Re: Avoiding OOM for large datasets

2013-12-11 Thread Ted Dunning
This is not right.  THe sequential version would have finished long before
this for any reasonable value of k.

I do note, however, that you have set k = 200,000 where you only have
300,000 documents.  Depending on which value you set (I don't have the code
handy), this may actually be increased inside the streaming k-means when it
computes the number of sketch centroids by a factor of roughly 2 log N
\approx 2 * 18.  This gives far more clusters than you have data points
which is silly.

Try again with a more reasonably value of k such as 1000.





On Wed, Dec 11, 2013 at 7:02 AM, Amir Mohammad Saied amirsa...@gmail.comwrote:

 Hi,

 I first tried Streaming K-means with about 5000 news stories, and it worked
 just fine. Then I tried it over 300,000 news stories and gave it 10GB of
 RAM. After more than 43 hours, It was still in the last merge-pass when I
 eventually decided to stop it.

 I set K to 20 and KM 2522308 (its for detecting similar/related news
 stories). Using these values, is it expected to take so long?

 Cheers,

 amir


 On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied amirsa...@gmail.com
 wrote:

  Suneel,
 
  Thanks!
 
  I tried Streaming K-Means, and now I've two naive questions:
 
  1) If I understand correctly to use the results of streaming k-means I
  need to iterate over all of my vectors again and assign them to the
 cluster
  with the closest centroid to the vector, right?
 
  2) In clustering news, the number of clusters isn't known beforehand. We
  used to use canopy as a fast approximate clustering technique, but as I
  understand streaming k-means requires K in advance. How can I avoid
  guessing K?
 
  Regards,
 
  Amir
 
 
 
  On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
  Amir,
 
 
  This has been reported before by several others (and has been my
  experience too). The OOM happens during Canopy Generation phase of
 Canopy
  clustering because it only runs with a single reducer.
 
  If you are using Mahout 0.8 (or trunk), suggest that u look at the new
  Streaming Kmeans clustering which is a quicker and more efficient than
 the
  traditional Canopy - KMeans.
 
  See the following link for how to run Streaming KMeans.
 
 
 
 http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means
 
 
 
 
 
 
 
 
 
 
 
  On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied 
  amirsa...@gmail.com wrote:
 
  Hi,
 
  I've been trying to run Mahout (with Hadoop) on our data for quite
  sometime
  now. Everything is fine on relatively small data sets, but when I try to
  do
  K-Means clustering with the aid of Canopy on like 30 documents, I
  can't
  even get past the canopy generation because of OOM. We're going to
 cluster
  similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead
 to
  desired results on sample data).
 
  I tried setting both mapred.map.child.java.opts, and
  mapred.reduce.child.java.opts to -Xmx4096M, I also
  exported HADOOP_HEAPSIZE to 4000, and still having issues.
 
  I'm running all of this in Hadoop's single node, pseudo-distributed mode
  on
  a machine with 16GB of RAM.
 
  Searching Internet for solutions I found this[1]. One of the bullet
 points
  states that:
 
  In all of the algorithms, all clusters are retained in memory by
 the
  mappers and reducers
 
  So my question is, does Mahout on Hadoop only help in distributing CPU
  bound operations? What one should do if they have a large dataset, and
  only
  a handful of low-RAM commodity nodes?
 
  I'm obviously a newbie, thanks for bearing with me.
 
  [1]
 
 
 http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3c506307eb.3090...@windwardsolutions.com%3E
 
  Cheers,
 
  Amir
 
 
 



Re: Slope one algorithm performance

2013-12-08 Thread Ted Dunning
Use a better recommender.  Slope one is just there for completeness.  



Sent from my iPhone

 On Dec 8, 2013, at 2:24, Siddharth Patnaik spatnai...@gmail.com wrote:
 
 What should be done to improve the runtime performance?


Re: SVM Implementation for mahout?

2013-12-08 Thread Ted Dunning

The problem of correlation of features is clearly present in text, but it is 
not so clear what the effect will be. For naive bayes this has the effect of 
making the classifier over confident but it usually still works reasonably 
well.  For logistic regression without regularization it can cause the learning 
algorithm to fail (mahout'so logistic regression is regularized, btw). 

Empirical evidence dominates theory in this situation. 

Sent from my iPhone

 On Dec 8, 2013, at 9:14, Fernando Santos fernandoleandro1...@gmail.com 
 wrote:
 
 Now just a theoretical doubt. In a text classification example, what would
 it mean to have features that are high correlated?  I mean, in this case
 our features are basically words, do you have an example of how these
 features can not be independant? This concept is not really clear in my
 mind...


Re: SVM Implementation for mahout?

2013-12-08 Thread Ted Dunning
On Sun, Dec 8, 2013 at 5:50 PM, Fernando Santos 
fernandoleandro1...@gmail.com wrote:

 Actually I had never heard of PCA and LDA. I'll take a look on it.


PCA and LDA are probably not quite what you want for Naive Bayes,
especially in Mahout.  There is an assumption of a sparse binary
representation for data.


Re: Question about Pearson Correlation in non-Taste mode

2013-12-06 Thread Ted Dunning
See

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf
http://arxiv.org/abs/1207.1847





On Fri, Dec 6, 2013 at 1:09 PM, Amit Nithian anith...@gmail.com wrote:

 Hey Sebastian,

 Thanks again for the explanation. So now you have me intrigued about
 something else. Why is it that logliklihood ratio test is a better measure
 for essentially implicit ratings? Are there resources/research papers you
 can point me to explaining this?

 Take care
 Amit


 On Sun, Dec 1, 2013 at 9:25 AM, Sebastian Schelter
 ssc.o...@googlemail.comwrote:

  Hi Amit,
 
  No need to excuse for picking on me, I'm happy about anyone digging into
  the paper :)
 
  The reason, I implemented Pearson in this (flawed) way has to do with
  the way the parallel algorithm works:
 
  It never compares two item vectors in memory, instead it preprocesses
  the vectors and computes sparse dot products in parallel. The centering
  which is usually done for Pearson correlation is dependent on which pair
  of vectors you're currently looking at (and doesn't fit the parallel
  algorithm). We had an earlier implementation that didn't have this flaw,
  but was way slower than the current one.
 
  Rating prediction on explicit feedback data like ratings for which
  Pearson correlation is mostly used in CF, is a rather academic topic and
  in science there are nearly no datasets that really require you to go to
  Hadoop.
 
  On the other hand item prediction on implicit feedback data (like
  clicks) is the common scenario in the majority of industry usecases, but
  here count-based similarity measures like the loglikelihood ratio test
  give much better results. The current implementation of Mahout's
  distributed itembased recommender is clearly designed and tuned for the
  latter usecase.
 
  I hope that answers your question.
 
  --sebastian
 
  On 01.12.2013 18:10, Amit Nithian wrote:
   Thanks guys! So the real question is not so much what's the average of
  the
   vector with the missing rating (although yes that was a question) but
   what's the average of the vector with all the ratings specified but the
   second rating that is not shared with the first user:
   [5 - 4] vs [4 5 2].
  
   If we agree that the first is 4.5 then is the second one 11/3 or 3
   ((4+2)/2)? Taste has this as ((4+2)/2) while distributed mode has it as
   11/3.
  
   Since Taste (and Lenskit) is sequential, it can (and will only) look at
   co-occurring ratings whereas the Hadoop implementation doesn't. The
 paper
   that Sebastian wrote has a pre-processing step where (for Pearson) you
   subtract each element of an item-rating vector from the average rating
   which implies that each item-rating vector is treated independently of
  each
   other whereas in the sequential/non-distributed mode it's all
 considered
   together.
  
   My main reason for posting is because the Taste implementation of
  item-item
   similarity differs from the distributed implementation. Since I am
  totally
   new to this space and these similarities I wanted to understand if
 there
  is
   a reason for this difference and whether or not it matters. Sounds like
   from the discussion it doesn't matter but understanding why helps me
   explain this to others.
  
   My guess (and I'm glad Sebastian is on this list so he can help
   confirm/deny this.. sorry I'm not picking on you just happy to be able
 to
   talk to you about your good paper) is that considering co-occuring
  ratings
   in a distributed implementation would require access to the full matrix
   which defeats the parallel nature of computing item-item similarity?
  
   Thanks again!
   Amit
  
  
   On Sun, Dec 1, 2013 at 2:55 AM, Sean Owen sro...@gmail.com wrote:
  
   It's not an issue of how to be careful with sparsity and subtracting
   means, although that's a valuable point in itself. The question is
   what the mean is supposed to be.
  
   You can't think of missing ratings as 0 in general, and the example
   here shows why: you're acting as if most movies are hated. Instead
   they are excluded from the computation entirely.
  
   m_x should be 4.5 in the example here. That's consistent with
   literature and the other implementations earlier in this project.
  
   I don't know the Hadoop implementation well enough, and wasn't sure
   from the comments above, whether it does end up behaving as if it's
   4.5 or 3. If it's not 4.5 I would call that a bug. Items that
   aren't co-rated can't meaningfully be included in this computation.
  
  
   On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
   Good point Amit.
  
   Not sure how much this matters.  It may be that
   PearsonCorrelationSimilarity is bad name that should be
   PearonInspiredCorrelationSimilarity.  My guess is that this
   implementation
   is lifted directly from the very early recommendation literature and
 is
   reflective of the way that it was used back then.
  
  
 
 



Re: Question about Pearson Correlation in non-Taste mode

2013-12-06 Thread Ted Dunning
The second link was an article I wrote that led eventually to the
dissertation (third link).




On Fri, Dec 6, 2013 at 5:15 PM, Jason Xin jason@sas.com wrote:

 Ted,

 Is this your doctoral Accurate Methods for the Statistics of Surprise and
 Coincidence , the second one PDF you attached, or you have another one you
 can forward to me, your doctoral dissertation? Thanks.

 Jason Xin

 -Original Message-
 From: Ted Dunning [mailto:ted.dunn...@gmail.com]
 Sent: Friday, December 06, 2013 7:56 PM
 To: user@mahout.apache.org
 Subject: Re: Question about Pearson Correlation in non-Taste mode

 See

 http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
 http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf
 http://arxiv.org/abs/1207.1847





 On Fri, Dec 6, 2013 at 1:09 PM, Amit Nithian anith...@gmail.com wrote:

  Hey Sebastian,
 
  Thanks again for the explanation. So now you have me intrigued about
  something else. Why is it that logliklihood ratio test is a better
  measure for essentially implicit ratings? Are there resources/research
  papers you can point me to explaining this?
 
  Take care
  Amit
 
 
  On Sun, Dec 1, 2013 at 9:25 AM, Sebastian Schelter
  ssc.o...@googlemail.comwrote:
 
   Hi Amit,
  
   No need to excuse for picking on me, I'm happy about anyone digging
   into the paper :)
  
   The reason, I implemented Pearson in this (flawed) way has to do
   with the way the parallel algorithm works:
  
   It never compares two item vectors in memory, instead it
   preprocesses the vectors and computes sparse dot products in
   parallel. The centering which is usually done for Pearson
   correlation is dependent on which pair of vectors you're currently
   looking at (and doesn't fit the parallel algorithm). We had an
   earlier implementation that didn't have this flaw, but was way slower
 than the current one.
  
   Rating prediction on explicit feedback data like ratings for which
   Pearson correlation is mostly used in CF, is a rather academic topic
   and in science there are nearly no datasets that really require you
   to go to Hadoop.
  
   On the other hand item prediction on implicit feedback data (like
   clicks) is the common scenario in the majority of industry usecases,
   but here count-based similarity measures like the loglikelihood
   ratio test give much better results. The current implementation of
   Mahout's distributed itembased recommender is clearly designed and
   tuned for the latter usecase.
  
   I hope that answers your question.
  
   --sebastian
  
   On 01.12.2013 18:10, Amit Nithian wrote:
Thanks guys! So the real question is not so much what's the
average of
   the
vector with the missing rating (although yes that was a question)
but what's the average of the vector with all the ratings
specified but the second rating that is not shared with the first
 user:
[5 - 4] vs [4 5 2].
   
If we agree that the first is 4.5 then is the second one 11/3 or 3
((4+2)/2)? Taste has this as ((4+2)/2) while distributed mode has
it as 11/3.
   
Since Taste (and Lenskit) is sequential, it can (and will only)
look at co-occurring ratings whereas the Hadoop implementation
doesn't. The
  paper
that Sebastian wrote has a pre-processing step where (for Pearson)
you subtract each element of an item-rating vector from the
average rating which implies that each item-rating vector is
treated independently of
   each
other whereas in the sequential/non-distributed mode it's all
  considered
together.
   
My main reason for posting is because the Taste implementation of
   item-item
similarity differs from the distributed implementation. Since I am
   totally
new to this space and these similarities I wanted to understand if
  there
   is
a reason for this difference and whether or not it matters. Sounds
like from the discussion it doesn't matter but understanding why
helps me explain this to others.
   
My guess (and I'm glad Sebastian is on this list so he can help
confirm/deny this.. sorry I'm not picking on you just happy to be
able
  to
talk to you about your good paper) is that considering co-occuring
   ratings
in a distributed implementation would require access to the full
matrix which defeats the parallel nature of computing item-item
 similarity?
   
Thanks again!
Amit
   
   
On Sun, Dec 1, 2013 at 2:55 AM, Sean Owen sro...@gmail.com wrote:
   
It's not an issue of how to be careful with sparsity and
subtracting means, although that's a valuable point in itself.
The question is what the mean is supposed to be.
   
You can't think of missing ratings as 0 in general, and the
example here shows why: you're acting as if most movies are
hated. Instead they are excluded from the computation entirely.
   
m_x should be 4.5 in the example here. That's consistent with
literature and the other implementations

Re: KMeans cluster analysis

2013-12-05 Thread Ted Dunning
Angelo,

The first question is how you intend to define which items are similar.

Also, what is the intended use of the clustering?  Without knowing that, it
is very hard to say how to best do the clustering.

For instance, are two records more similar if the record are at the same
time of day?  Or do you really want to cluster arcs by getting all of the
records for a single arc and finding other arcs which have similar
characteristics in different weather conditions and time of day?

Without some more idea about what is going on, it will not be possible for
you to succeed with clustering, nor for us to help you.



On Thu, Dec 5, 2013 at 3:38 AM, Angelo Immediata angelo...@gmail.comwrote:

 Hi

 First of all I'm sorry if I repeat this question..but it's pretty old one
 and I really need some help since I'm a really newbie to mahout and hadoop

 I need to do some cluster analysis by using some data. At the beginning
 this data can be not too much huge, but after some time they can be really
 huge (I did some calculation and after 1 year this data cann be around 37
 billion of records) Since I have this huge data, I decided to do the
 cluster analysis by using Mahout on the top of Apache Hadoop and its HDFS.
 Regarding where to store this big amount of data I decided to use Apache
 HBase always on the top of Apache Hadoop HDFS

 Now I need to do this cluster analysi by considering some environment
 variables. These variable may be the following:

- *recordId* = id of the record
- *arcId *= id of the arc between 2 points of my street graph
- *mediumVelocity *= medium velocity of the considered arc in the
specified
- *vehiclesNumber* = number of the monitored vehicles in order to get
that velocity
- *meteo *= weather condition (a numeric representing if there is sun,
rain etc...)
- *manifestation *= a numeric representing if there is any kind of
manifestation (sport manifestation or other)
- *day of the week*
- *month of the year*
- *hour of the day*
- *vacation *= a numeric representing if it's a vacation day or a
working day

 So my data are so formatted (raw representation):

 *recordId arcId mediumVelocity vehiclesNumber meteo manifestation
 weekDay yearMonth dayHour vacation*
 1 1  34.5201  34
2011   10  3
 2 15666.53 2  51
20086  2

 The clustering should be done by taking care of at least these variables:
 meteo, manifestation, weekDay, dayHour, vacation

 No in order to take data from HBase I used the MapReduce funcionalities
 provided by HBase; basically I wrote this code:

 My MapperReducer class:

 package hadoop.mapred;

 import hadoop.hbase.model.HistoricalDataModel;

 import java.io.IOException;

 import org.apache.commons.logging.Log;

 import org.apache.commons.logging.LogFactory;

 import org.apache.hadoop.hbase.client.Result;

 import org.apache.hadoop.hbase.io.ImmutableBytesWritable;

 import org.apache.hadoop.hbase.mapreduce.TableMapper;

 import org.apache.hadoop.hbase.mapreduce.TableReducer;

 import org.apache.hadoop.hbase.util.Bytes;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.SequenceFile;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.io.Writable;

 import org.apache.hadoop.mapred.join.TupleWritable;

 public class HistoricalDataMapRed {

 public static class HistoricalDataMapper extends TableMapperText,
 TupleWritable {

 private static final Log logger =
 LogFactory.getLog(HistoricalDataMapper.class.getName());

 private int numRecords = 0;

 @SuppressWarnings({ unchecked, rawtypes })

 protected void map(Text key, Result result,
 org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException,
 InterruptedException {

 try{

 Writable[] vals = new Writable[4];

 IntWritable calFest = new

 IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY,
 HistoricalDataModel.CALENDARIO_FESTIVO)));

 vals[0] = calFest;

 IntWritable calEven = new

 IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY,
 HistoricalDataModel.CALENDARIO_EVENTI)));

 vals[1] = calEven;

 IntWritable meteo = new

 IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY,
 HistoricalDataModel.EVENTO_METEO)));

 vals[2] = meteo;

 IntWritable manifestazione = new

 IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY,
 HistoricalDataModel.MANIFESTAZIONE)));

 vals[3] = manifestazione;

 String chiave = Bytes.toString(result.getRow());

 Text text = new Text();

 text.set(chiave);

 context.write(text, new TupleWritable(vals));

 numRecords++;

 if ((numRecords % 1) == 0) {

 context.setStatus(mapper processed  + numRecords +  records so far);

 }

 }catch(Exception e){

 String message = Errore nel mapper; messaggio errore: +e.getMessage();

 

Re: Outlier detection/Pruning

2013-12-05 Thread Ted Dunning
You should move to 0.8 and explore ball k-means.




On Tue, Dec 3, 2013 at 8:44 PM, Prabhakar Srinivasan 
prabhakar.sriniva...@gmail.com wrote:

 Hello
 I am using Mahout 0.7 currently and this question is pertaining to that
 version. I am using Canopy clustering (CanopyDriver class)  first to
 determine the optimal number of clusters that best fits the dataset and
 passing that information as parameter to Kmeans clustering (kmeansDriver
 class).

 Regards
 Prabhakar


 On Tue, Dec 3, 2013 at 6:00 PM, Ted Dunning ted.dunn...@gmail.com wrote:

  Can you be more specific about which code you are asking about?
 
  The ball k-means implementation provides a capability somewhat like this,
  but perhaps in a more clearly defined way.
 
 
  On Tue, Dec 3, 2013 at 9:34 AM, Prabhakar Srinivasan 
  prabhakar.sriniva...@gmail.com wrote:
 
   Hello!
   Can someone point me to some explanatory documentation for Outlier
   Detection  Removal in Clustering in Mahout. I am unable to understand
  the
   internal mechanism of outlier detection just by reading the Javadoc:
   clusterClassificationThreshold Is a clustering strictness / outlier
  removal
   parameter. Its value should be between 0 and 1. Vectors having pdf
 below
   this value will not be clustered.
  
   What does the pdf represent?
  
   Thanks
   Prabhakar
  
 



Re: TF-IDF confusion

2013-12-03 Thread Ted Dunning
Ani,

I really don't understand your second point.

Here is how I view things ... if you can phrase things in those terms, it
might help me understand your question.

The TF part of TF-IDF refers to the term frequencies in a document.
 Typically, each possible word is assigned to a positive integer that
represents a position in a vector.  A term frequency vector is a sparse
vector with counts or functions of counts at locations corresponding to the
words in a document.

If the document has words that were do not have assigned positions in the
vector, they are either ignored or the counts are put into a special
UNKNOWN-WORD position.

By definition, there is no way that the term frequency vector can be too
long or to short.  Likewise, a document's length only matters if the counts
get too large to store (completely implausible for this to happen since we
use a double).

The IDF part of TF-IDF refers to weights that are applied to these TF
vectors.  These weights are conventionally computed by using the log of the
number of documents which have the corresponding word.  The IDF weighting
has one weight for each position in the term frequency vector and thus
length is again not a problem.

This is why I don't understand your second point.

Is it that you mean that many of the words in the document do not have
assigned positions in the term frequency vector?  If so, that you means
that you didn't analyze the corpus ahead of time to get a good dictionary
of word locations.

Or is it that you are worried that the counts would be large?




On Tue, Dec 3, 2013 at 7:03 AM, Ani Tumanyan a...@bnotions.com wrote:

 Hello everyone,

 I'm working on a project, where I'm trying to extract topics from news
 articles. I have around 500,000 articles as a dataset. Here are the steps
 that I'm following:

 1. First of all I'm doing some sort of preprocessing. For this I'm using
 Behemoth to annotate the document and get rid of non-English documents,
 2. Then I'm running Mahout's sparse vector command to generate TF-IDF
 vectors. The problem with TF-IDF vector is that the number of words for a
 document is far more than the number of words in TF vectors. Moreover there
 are some words/terms in TF-IDF vector that didn't appear in that specific
 document anyway. Is this a correct behaviour or there is something wrong
 with my approach?

 Thanks in advance!

 Ani


Re: Outlier detection/Pruning

2013-12-03 Thread Ted Dunning
Can you be more specific about which code you are asking about?

The ball k-means implementation provides a capability somewhat like this,
but perhaps in a more clearly defined way.


On Tue, Dec 3, 2013 at 9:34 AM, Prabhakar Srinivasan 
prabhakar.sriniva...@gmail.com wrote:

 Hello!
 Can someone point me to some explanatory documentation for Outlier
 Detection  Removal in Clustering in Mahout. I am unable to understand the
 internal mechanism of outlier detection just by reading the Javadoc:
 clusterClassificationThreshold Is a clustering strictness / outlier removal
 parameter. Its value should be between 0 and 1. Vectors having pdf below
 this value will not be clustered.

 What does the pdf represent?

 Thanks
 Prabhakar



Re: Clustering Spatial Data

2013-12-02 Thread Ted Dunning
Peter,

What you say is a bit confusing to me.

You say you have centers already.  But then you talk about algorithms which
find the centers.

Also, you say you want to assign points based on centers, but you also say
that clusters have different shapes, area, size and point count.  Do you
mean that assignment should be purely based on proximity to the center and
that the shape will be whatever it happens to be as a result?

Or do you mean that there is an a priori known shape that has to be taken
into account during point assignment?

If proximity is the only question, and if you can use great circle distance
as your proximity measure, then this problem is fairly easy and can be
handled in just a few lines of code.  One easy way to handle this is to
convert your centers to normalized x, y, z locations using

   x = cos \lambda cos \phi
   y = cos \lambda sin \phi
   z = sin \lambda

where \lambda is the latitude and \phi is the longitude.  Great circle
distance is monotonically related to Euclidean distance in 3-space and thus
is inversely monotonically related to dot production.

This means you can sort the centers by distance to a point by simply
computing x,y,z for the point and then doing the dot product and sorting in
descending order. The nice thing with this is that there are no trig
functions inside your inner loop.

You can also use the haversine formula, but that requires 3-4 trig
functions in the inner loop and is likely to be slower.

You don't really need Mahout for this at all (unless I completely
misunderstand your problem, which is quite possible).



On Mon, Dec 2, 2013 at 1:31 AM, Peter K peat...@yahoo.de wrote:

 Hi there,

 I've have no experience with mahout but I know that it will solve my
 problem :) !

 I've the following requirements:

  * No hadoop setup should be necessary. I want a simple approach and I
 know this is possible with mahout!
  * I have lots of points (~100 million) but also some RAM (32GB)
  * I know the clusters upfront via its center positions.
  * I need to assign every point to exactly one cluster.
  * Every cluster can have a different shape, area size and point count

 I've found:
 http://en.wikipedia.org/wiki/OPTICS_algorithm
 http://en.wikipedia.org/wiki/DBSCAN

 Both algorithms do not really pay attention to the fixed cluster center
 but I think I will start there. Is one of them implemented in mahout?

 Or do you have another idea or hint/link?

 Regards,
 Peter.



Re: Pig vector project

2013-12-02 Thread Ted Dunning
Elephant bird is distinctly superior to Pig Vector for many things (it
moved forward, Pig Vector did not).

I believe here is also a Twitter internal project known as PigML which is
much more what Pig Vector wanted to be.

There is also https://github.com/hanborq/pigml, but I think it is very
different.

You might ping @pbrane (Jake Mannix jake.man...@gmail.com) or @lintool
(Jimmy Lin ji...@twitter.com) to see if they have anything to say on the
topic.



On Mon, Dec 2, 2013 at 4:14 PM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 You might also look into elephant-bird from Twitter; covers a lot of
 ground.

 https://github.com/kevinweil/elephant-bird


 On Mon, Dec 2, 2013 at 4:10 PM, Sameer Tilak ssti...@live.com wrote:

 
 
 
  Hi All,We are using Pig top build our data pipeline.
  I came across the following:https://github.com/tdunning/pig-vector
  The last commit was 2 yrs ago. Any information on will there be any
  further work on this project?
 
 



Re: Mahout for clustering

2013-12-02 Thread Ted Dunning
Do you want to cluster users or items?

For items, the vectorization that you suggest will work reasonably well,
especially if you use TF.IDF weighting and normalize the resulting vectors.

You can also use one of the matrix decomposition techniques and cluster the
resulting vectors.  The spectral clustering system that is part of Mahout
will do all of this in one step.  SVD + streaming k-means + ball k-means
should also work well.





On Mon, Dec 2, 2013 at 4:22 PM, Sameer Tilak ssti...@live.com wrote:




 Hi All,We are using Apache Pig for building our data pipeline. We have
 data in the following fashion:
 userid, age, items {code 1, code 2, ….}, few other features...
 Each item has a unique alphanumeric code.  I would like to use mahout for
 clustering it.  Based on my current  reading I see following few options
 1. Map each alphanumeric item code to a numeric code -- A1 - 0,
 A2 - 1, A2 -2 etc. Then run the clustering algorithm on the
 reformatted data and then map the results back onto the real item codes.2.
 Represent info on item codes  as 1 X M matrix where a column represents an
 items (1 if a given user has viewed a particular item 0 otherwise) and will
 have millions of columns. So each user will have id, age, and this matrix.
 Not sure if this will work…..
 We also want to do frequency pattern mining etc. on the same data. Any
 thoughts on data representation and clustering will be great.




Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-02 Thread Ted Dunning
Inline


On Mon, Dec 2, 2013 at 8:55 AM, optimusfan optimus...@yahoo.com wrote:

 ... To accomplish this, we used AdaptiveLogisticRegression and trained 46
 binary classification models.  Our approach has been to do an 80/20 split
 on the data, holding the 20% back for cross-validation of the models we
 generate.


Sounds reasonable.


 We've been playing around with a number of different parameters, feature
 selection, etc. and are able to achieve pretty good results in
 cross-validation.


When you say cross validation, do you mean the magic cross validation that
the ALR uses?  Or do you mean your 20%?


  We have a ton of different metrics we're tracking on the results, most
 significant to this discussion is that it looks like we're achieving very
 good precision (typically .85 or .9) and a good f1-score (typically again
 .85 or .9).


These are extremely good results.   In fact they are good enough I would
starting thinking about a target leak.

 However, when we then take the models generated and try to apply them to
 some new documents, we're getting many more false positives than we would
 expect.  Documents that should have 2 categories are testing positive for
 16, which is well above what I'd expect.  By my math I should expect 2 true
 positives, plus maybe 4.4 (.10 false positives * 44 classes) additional
 false positives.


You said documents.  Where do these documents come from?

One way to get results just like you describe is if you train on raw news
wire that is split randomly between training and test.  What can happen is
that stories that get edited and republished have a high chance of getting
at least one version in both training and test.  This means that the
supposedly independent test set actually has significant overlap with the
training set.  If your classifier over-fits, then the test set doesn't
catch the problem.

Another way to get this sort of problem is if you do your training/test
randomly, but the new documents come from a later time.  If your classifier
is a good classifier, but is highly specific to documents from a particular
moment in time, then your test performance will be a realistic estimate of
performance for contemporaneous documents but will be much higher than
performance on documents from a later point in time.

A third option could happen if your training and test sets were somehow
scrubbed of poorly structured and invalid documents.  This often happens.
 Then, in the real system, if the scrubbing is not done, the classifier may
fail because the new documents are not scrubbed in the same way as the
training documents.

These are just a few of the ways that *I* have screwed up building
classifiers.  I am sure that there are more.

We suspected that perhaps our models were underfitting or overfitting,
 hence this post.  However, I'll take any and all suggestions for anything
 else we should be looking at.


Well, I think that, almost by definition, you have an overfitting problem
of some kind.  The question is what kind.  The only think that I think that
you don't have is a frank target leak in your documents.  That would
(probably) have given you even higher scores on your test case.


Re: Question about Pearson Correlation in non-Taste mode

2013-12-01 Thread Ted Dunning
Good point Amit.

Not sure how much this matters.  It may be that
PearsonCorrelationSimilarity is bad name that should be
PearonInspiredCorrelationSimilarity.  My guess is that this implementation
is lifted directly from the very early recommendation literature and is
reflective of the way that it was used back then.

Remember that the context here is prediction of ratings.  If you assume
that you really want correlation and that missing elements are zero, then
this is mathematically wrong.  On the other hand, if you assume missing
elements are equal to the mean (whatever it is), then this definition is
correct.

In any case, I don't think that PearsonCorrelationSimilarity should be
fixed at this point.  First of all, a substantial change here is somewhat
risky since there may be people who depend on current behavior.  Second, I
think that this is almost never a particularly good recommendation
algorithm so even if the proposed change is a small improvement, it will
have negligible positive effect on the universe of production recommenders.

Remember that this function is not a stats routine.  It is an embodiment of
recommendation practice.  Were it the former, I would strongly recommend we
fix it.






On Sat, Nov 30, 2013 at 10:18 AM, Amit Nithian anith...@gmail.com wrote:

 Hi Ted,

 Thanks that is what I would have thought too but I don't think that the
 Pearson Similarity (in Hadoop mode) does this:

 in

 org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.PearsonCorrelationSimilarity
 around line 31

 double average = vector.norm(1) / vector.getNumNonZeroElements();
 Which looks like it's taking the sum and dividing by the number of defined
 elements. Which would make my [5 - 4] average be 4.5.

 Thanks again
 Amit

 On Fri, Nov 29, 2013 at 10:34 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  On Fri, Nov 29, 2013 at 10:16 PM, Amit Nithian anith...@gmail.com
 wrote:
 
   Hi Ted,
  
   Thanks for your response. I thought that the mean of a sparse vector is
   simply the mean of the defined elements? Why would the vectors become
   dense unless you're meaning that all the undefined elements (0?) now
 will
   be (0-m_x)?
  
 
  Yes.  Just so.  All those zero elements become non-zero and the vector is
  thus non-dense.
 
 
  
   Looking at the following example:
   X = [5 - 4] and Y= [4 5 2].
  
   is m_x 4.5 or 3?
 
 
  3.
 
  This is because the elements of X are really 5, 0, and 4.  The zero is
 just
  not stored, but it still is the value of that element.
 
 
   Is m_y 11/3 or (6/2) because we ignore the 5 since it's
   counterpart in X is undefined?.
  
 
  11/3
 



Re: Test naivebayes task running really slowly and not in distributed mode

2013-12-01 Thread Ted Dunning
Did the training run use both machines?

How large is the input for the test run?

Is it contained in a single file?




On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos 
fernandoleandro1...@gmail.com wrote:

 Hello everyone,

 I'm trying to do a text classification task. My dataset is not that big, I
 have around 700.000 small comments.

 Following the 20newsgroups example, I created the vector from the text,
 splited it and trained the model. Now I'm trying to test it but it is
 really slow and also I cannot make it to run in the cluster. Whatever I do
 it always just run in one machine. And I think the testnb algorithm is
 supposed to run using mapReduce, right?

 I also tried this example here (

 http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/
 )
 but also, the other box in the cluster is not executing any task. In fact,
 when I execute the testnb or using the MapReduceClassifier proposed in this
 tutorial above, I get one job, executing one task and this task runs really
 slowly (like 6 minutes to achieve 0.13% of the task).

 I think I must be doing something wrong so that the cluster is not working
 how it is supposed to be.

 I have a cluster with 2 box configured with hadoop 0.20.205.0 and using
 mahout 0.8.

 I also tried versions 0.7 and 0.6 of mahout but nothing changed.

 Any help would be aprreciated.


 The logs I have from this task:


 *stdout logs*

 Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
 /usr/local/hadoop/lib/libhadoop.so which might have disabled stack
 guard. The VM will try to fix the stack guard now.
 It's highly recommended that you fix the library with 'execstack -c
 libfile', or link it with '-z noexecstack'.


 *syslog logs*

 2013-11-30 17:09:19,191 WARN org.apache.hadoop.util.NativeCodeLoader:
 Unable to load native-hadoop library for your platform... using
 builtin-java classes where applicable
 2013-11-30 17:09:19,400 WARN
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
 already exists!
 2013-11-30 17:09:19,472 INFO org.apache.hadoop.util.ProcessTree:
 setsid exited with exit code 0
 2013-11-30 17:09:19,474 INFO org.apache.hadoop.mapred.Task:  Using
 ResourceCalculatorPlugin :
 org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5810d963
 2013-11-30 17:09:19,543 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb
 = 100
 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: data
 buffer = 79691776/99614720
 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: record
 buffer = 262144/327680





 --
 Fernando Santos
 +55 61 8129 8505



Re: Clustering without Hadoop

2013-12-01 Thread Ted Dunning
The new Ball k-means and streaming k-means implementations have non-Hadoop
versions.  The streaming k-means implementation also has a threaded
implementation that runs without Hadoop.

The threaded streaming k-means implementation should be pretty fast.



On Sun, Dec 1, 2013 at 7:55 PM, Shan Lu shanlu...@gmail.com wrote:

 Thanks, Suneel, I'll try this way.

 In this recommender example:

 https://github.com/ManuelB/facebook-recommender-demo/blob/master/src/main/java/de/apaxo/bedcon/AnimalFoodRecommender.java#L142
 ,

 they only use mahout api. So I am thinking that can I do the clustering
 similarly.


 On Sun, Dec 1, 2013 at 10:42 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  Shan,
 
  All of Mahout implementations use Hadoop API, but if u r trying to run
  kmeans in sequential (non-MapReduce) mode; pass in  runSequential = true
  instead of false as the last parameter to KMeansDriver.run() or Amit run
  them in LOCAL_MODE as pointed out earlier by Amit.
 
 
 
 
 
 
 
  On Sunday, December 1, 2013 10:28 PM, Shan Lu shanlu...@gmail.com
 wrote:
 
  Thanks for your reply. In the example code, they run the k-means
 algorithm
  using org.apache.hadoop.conf.Configuration,
  org.apache.hadoop.fs.FileSystem, and org.apache.hadoop.fs.Path
 parameters.
  Is there any algorithm that doesn't need any Configuration and Path
  parameter, just use the data in memory? I mean, can I  run the k-means
  algorithm without using the hadoop api, just using java? Thanks.
 
 
  On Sun, Dec 1, 2013 at 9:58 PM, Amit Nithian anith...@gmail.com wrote:
 
   When you say without hadoop does that include local mode? You can run
  these
   examples in local mode that doesn't require a cluster for testing and
   poking around. Everything then runs in a single jvm.
   On Dec 1, 2013 9:18 PM, Shan Lu shanlu...@gmail.com wrote:
  
Hi,
   
I am working on a very simple k-means clustering example. Is there a
  way
   to
run clustering algorithms in mahout without using Hadoop? I am
 reading
   the
book Mahout in Action. In chapter 7, the hello world clustering
 code
example, they use
==
   
KMeansDriver.run(conf, new Path(testdata/points), new
Path(testdata/clusters),
  new Path(output), new EuclideanDistanceMeasure(), 0.001, 10,
  true, false);
   
==
to run the k-means algorithm. How can I run the k-means algorithm
  without
Hadoop?
   
Thanks!
 
   
Shan
   
  
 
 
 
  --
  Shan Lu
  ECE Dept., NEU, Boston, MA 02115
 



 --
 Shan Lu
 ECE Dept., NEU, Boston, MA 02115



Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?

2013-11-29 Thread Ted Dunning
The default with the Mahout encoders is two probes.  This is unnecessary
with the intercept term, of course, if you protect the intercept term from
other updates, possible by encoding other data using a view of the original
feature vector.

For each probe, a different hash is used so each value is put into multiple
locations.  Multiple probes are useful in general to decrease the effect of
the reduced dimensionality of the hashed representation.



On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven paul.van.ho...@gmail.comwrote:

 For an example program using mahout I use the donut.csv sample data
 from the project (

 https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
 ). My code looks like this:

 import org.apache.mahout.math.RandomAccessSparseVector;
 import org.apache.mahout.math.Vector;
 import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
 import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
 import com.csvreader.CsvReader;

 public class Runner {

 //Set the path accordingly!
 public static final String csvInputDataPath = /path/to/donut.csv;

 public static void main(String[] args) {

 FeatureVectorEncoder encoder = new StaticWordValueEncoder(features);
 ArrayListRandomAccessSparseVector featureVectors =
  new ArrayListRandomAccessSparseVector();
 try {
 CsvReader csvReader = new CsvReader(csvInputDataPath);
 csvReader.readHeaders();
 while( csvReader.readRecord() ) {
 Vector featureVector = new RandomAccessSparseVector(30);
 featureVector.set(0, new Double(csvReader.get(x)));
 featureVector.set(1, new Double(csvReader.get(y)));
 featureVector.set(2, new Double(csvReader.get(c)));
 featureVector.set(3, new Integer(csvReader.get(color)));
 System.out.println(Before:  + featureVector.toString());
 encoder.addToVector(csvReader.get(shape).getBytes(),
 featureVector);
 System.out.println( After:  + featureVector.toString());
 featureVectors.add((RandomAccessSparseVector) featureVector);
 }
 } catch(Exception e) {
 e.printStackTrace();
 }

 System.out.println(Program is done.);
 }

 }


 What confuses me is the following output (one sample):

 Before:
 {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
  After:
 {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}

 As you can see, I added just one value shape to the vector. However
 two dimensions of this vector are encoded with 1.0. On the other hand,
 for some other data I get the output

 Before:
 {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
  After:
 {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}

 Why? I would expect that _always_ only one dimension gets occupied by
 1.0 as this is the standard case for categorial encoding. This way
 this seems to be wrong.

 Thanks in advance,
 Paul



Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?

2013-11-29 Thread Ted Dunning
If you always insert 1's for each element, then you can detect collisions
by inserting all your elements (or all elements in each document
separately) and looking for the max value in the vector.  If you see
something 1, you have a collision.

But collisions are actually good.  The only way to completely avoid them is
to use a vector as large as your vocabulary which is often painfully large.

You can also view multiple probes not so much as avoiding collisions, but
as making the linear transformation from the very large dimensional
representation of one dimension per word to the lower hashed representation
more likely to be nearly invertible in the sense that the Euclidean metric
will be approximately preserved.  Think Johnson-Lindenstrauss random
projections.



On Fri, Nov 29, 2013 at 1:54 AM, Paul van Hoven paul.van.ho...@gmail.comwrote:

 Hi, thanks for your quick reply. So multiple probes are a protection
 against collisions? After playing a little with the default length of
 a RandomAccessSparseVector object I noticed that (of course)
 collisions occur when the length is too short. Therefore, I'm asking
 myself if there is a possibility to check if a collision occurred
 after encoding a new value in the vector? This would give a user the
 information that the length of the chosen vector is too short. So far,
 I did not find any method in the api to check for that.

 2013/11/29 Ted Dunning ted.dunn...@gmail.com:
  The default with the Mahout encoders is two probes.  This is unnecessary
  with the intercept term, of course, if you protect the intercept term
 from
  other updates, possible by encoding other data using a view of the
 original
  feature vector.
 
  For each probe, a different hash is used so each value is put into
 multiple
  locations.  Multiple probes are useful in general to decrease the effect
 of
  the reduced dimensionality of the hashed representation.
 
 
 
  On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven 
 paul.van.ho...@gmail.comwrote:
 
  For an example program using mahout I use the donut.csv sample data
  from the project (
 
 
 https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv
  ). My code looks like this:
 
  import org.apache.mahout.math.RandomAccessSparseVector;
  import org.apache.mahout.math.Vector;
  import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
  import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
  import com.csvreader.CsvReader;
 
  public class Runner {
 
  //Set the path accordingly!
  public static final String csvInputDataPath = /path/to/donut.csv;
 
  public static void main(String[] args) {
 
  FeatureVectorEncoder encoder = new
 StaticWordValueEncoder(features);
  ArrayListRandomAccessSparseVector featureVectors =
   new ArrayListRandomAccessSparseVector();
  try {
  CsvReader csvReader = new CsvReader(csvInputDataPath);
  csvReader.readHeaders();
  while( csvReader.readRecord() ) {
  Vector featureVector = new RandomAccessSparseVector(30);
  featureVector.set(0, new Double(csvReader.get(x)));
  featureVector.set(1, new Double(csvReader.get(y)));
  featureVector.set(2, new Double(csvReader.get(c)));
  featureVector.set(3, new Integer(csvReader.get(color)));
  System.out.println(Before:  + featureVector.toString());
  encoder.addToVector(csvReader.get(shape).getBytes(),
  featureVector);
  System.out.println( After:  + featureVector.toString());
  featureVectors.add((RandomAccessSparseVector) featureVector);
  }
  } catch(Exception e) {
  e.printStackTrace();
  }
 
  System.out.println(Program is done.);
  }
 
  }
 
 
  What confuses me is the following output (one sample):
 
  Before:
  {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0}
   After:
 
 {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0}
 
  As you can see, I added just one value shape to the vector. However
  two dimensions of this vector are encoded with 1.0. On the other hand,
  for some other data I get the output
 
  Before:
  {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0}
   After:
 
 {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0}
 
  Why? I would expect that _always_ only one dimension gets occupied by
  1.0 as this is the standard case for categorial encoding. This way
  this seems to be wrong.
 
  Thanks in advance,
  Paul
 



Re: Question about Pearson Correlation in non-Taste mode

2013-11-29 Thread Ted Dunning
Well, the best way to compute correlation using sparse vectors is to make
sure you keep them sparse.  To do that, you must avoid subtracting the mean
by expanding whatever formulae you are using.  For instance, if you are
computing

(x - m_x) . (y - m_y)

(here . means dot product)

If you do this directly, then you lose all benefit of sparse vectors since
subtracting the means makes each vector dense.

What you should compute instead is this alternative form

   x . y - m_x e . y - m_y e . x + m_x m_y

(here e represents a vector full of 1's)

The dot product here is sparse and the expression m_x e . y can be computed
(at lease in Mahout) in map-reduce idiom as

y.aggregate(Functions.PLUS, Functions.mult(m_x))




On Fri, Nov 29, 2013 at 9:31 PM, Amit Nithian anith...@gmail.com wrote:

 Okay so I rethought my question and realized that the paper never really
 talked about collaborative filtering but just how to calculate item-item
 similarity in a scalable fashion. Perhaps this is the reason for why the
 common ratings aren't used? Because that's not a pre-req for this
 calculation?

 Although for my own clarity, I'd still like to get a better understanding
 of what it means to calculate the correlation between sparse vectors where
 you're normalizing each vector using a separate denominator.

 P.S. If my question(s) don't make sense please let me know for it's very
 possible I am completely misunderstanding something :-).

 Thanks again!
 Amit


 On Wed, Nov 27, 2013 at 8:23 AM, Amit Nithian anith...@gmail.com wrote:

  Hey Sebastian,
 
  Thanks again. Actually I'm glad that I am talking to you as it's your
  paper and presentation I have questions with! :-)
 
  So to clarify my question further, looking at this presentation (
  http://isabel-drost.de/hadoop/slides/collabMahout.pdf) you have the
  following user x item matrix:
  M   A   I
  A  51   4
  B  -25
  P  4   32
 
  If I want to calculate the pearson correlation between Matrix and
  Inception, I'd have the rating vectors:
  [5 - 4] vs [4 5 2].
 
  One of the steps in your paper is the normalization step which subtracts
  the mean item rating from each value and essentially do the L2Norm of
 this
  resulting vector (or in other words, the L2 norm of the mean-centered
  vector ?)
 
  The question I have had is what is the average rating for Matrix and
  Inception? I can see the following:
  Matrix - 4.5 (9/2), Inception - 3 (6/2) because you only consider shared
  ratings
  Matrix - 3 (9/3), Inception - 3.667 (11/3) assuming that the missing
  rating is 0
  Matrix - 4.5 (9/2), Inception - 3.667 (11/3) subtract from the average of
  all non-zero ratings == This is what I believe the current
 implementation
  does.
 
  Unfortunately, neither of these yield the 0.47 listed in the presentation
  but that's a separate issue. In my testing, I see that Mahout Taste
  (non-distributed) uses the 1st approach while the distributed approach
 uses
  the 3rd approach.
 
  I am okay with #3; however I just want to understand that this is the
 case
  and that it's okay. This is why I was asking about pearson correlation
  between vectors of different lengths because the average rating is
 being
  computed using a denominator (number of users) that is different between
  the two (2 vs 3).
 
  I know you said in practice that people don't use Pearson to compute
  inferred ratings but this is just for my complete understanding (and
 since
  it's the example used in your presentation). This same question applies
 to
  cosine as you are doing an L2-Norm of the vector as a pre-processing step
  and including/excluding non-shared ratings may make a difference.
 
  Thanks again!
  Amit
 
 
  On Wed, Nov 27, 2013 at 7:13 AM, Sebastian Schelter 
  ssc.o...@googlemail.com wrote:
 
  Hi Amit,
 
  Yes, it gives different results. However in practice, most people don't
  do rating prediction with Pearson coefficient, but use count-based
  measures like the loglikelihood ratio test.
 
  The distributed code doesn't look at vectors of different lengths, but
  simply assumes non-existent ratings as zero.
 
  --sebastian
 
  On 27.11.2013 16:09, Amit Nithian wrote:
   Comparing this against the non distributed (taste) gives different
  answers
   for item item similarity as of course the non distributed looks only
 at
   corated items. I was more wondering if this difference in practice
  mattered
   or not.
  
   Also I'm confused on how you can compute the Pearson similarity
 between
  two
   vectors of different length which essentially is going on here I
 think?
  
   Thanks again
   Amit
   On Nov 27, 2013 9:06 AM, Sebastian Schelter 
 ssc.o...@googlemail.com
   wrote:
  
   Yes, it is due to the parallel algorithm which only looks at
 co-ratings
   from a given user.
  
  
   On 27.11.2013 15:02, Amit Nithian wrote:
   Thanks Sebastian! Is there a particular reason for that?
   On Nov 27, 2013 7:47 AM, Sebastian Schelter 
  

Re: Question about Pearson Correlation in non-Taste mode

2013-11-29 Thread Ted Dunning
On Fri, Nov 29, 2013 at 10:16 PM, Amit Nithian anith...@gmail.com wrote:

 Hi Ted,

 Thanks for your response. I thought that the mean of a sparse vector is
 simply the mean of the defined elements? Why would the vectors become
 dense unless you're meaning that all the undefined elements (0?) now will
 be (0-m_x)?


Yes.  Just so.  All those zero elements become non-zero and the vector is
thus non-dense.



 Looking at the following example:
 X = [5 - 4] and Y= [4 5 2].

 is m_x 4.5 or 3?


3.

This is because the elements of X are really 5, 0, and 4.  The zero is just
not stored, but it still is the value of that element.


 Is m_y 11/3 or (6/2) because we ignore the 5 since it's
 counterpart in X is undefined?.


11/3


Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-28 Thread Ted Dunning
Yes.  Exactly.


On Thu, Nov 28, 2013 at 6:32 AM, Vishal Santoshi
vishal.santo...@gmail.comwrote:

 Absolutely. I will read through.  The idea is to first  fix the learning
 rate update equation in OLR.
 I think this code  in  OnlineLogisticRegression is the current equation ?

 @Override

   public double currentLearningRate() {

 return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() +
 stepOffset, forgettingExponent);

   }


 I presume that you would like  Adagrad-like solution to replace the above ?






 On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
  vishal.santo...@gmail.com
 
  
  
   Are we to assume that SGD is still a work in progress and
  implementations (
   Cross Fold, Online, Adaptive ) are too flawed to be realistically used
 ?
  
 
  They are too raw to be accepted uncritically, for sure.  They have been
  used successfully in production.
 
 
   The evolutionary algorithm seems to be the core of
   OnlineLogisticRegression,
   which in turn builds up to Adaptive/Cross Fold.
  
   b) for truly on-line learning where no repeated passes through the
  data..
  
   What would it take to get to an implementation ? How can any one help ?
  
 
  Would you like to help on this?  The amount of work required to get a
  distributed asynchronous learner up is moderate, but definitely not huge.
 
  I think that OnlineLogisticRegression is basically sound, but should get
 a
  better learning rate update equation.  That would largely make the
  Adaptive* stuff unnecessary, expecially if OLR could be used in the
  distributed asynchronous learner.
 



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-27 Thread Ted Dunning
No problem at all.  Kind of funny.



On Wed, Nov 27, 2013 at 7:08 AM, Vishal Santoshi
vishal.santo...@gmail.comwrote:

 Sorry to spam, I never meant the Hello to come out as Hell. Given a
 little disappointment in the mail, I figure I rather spam than be
 misunderstood,



 On Wed, Nov 27, 2013 at 10:07 AM, Vishal Santoshi 
 vishal.santo...@gmail.com
  wrote:

  Hell Ted,
 
  Are we to assume that SGD is still a work in progress and implementations
  ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used
 ?
  The evolutionary algorithm seems to be the core of
 OnlineLogisticRegression,
  which in turn builds up to Adaptive/Cross Fold.
 
  b) for truly on-line learning where no repeated passes through the
  data..
 
  What would it take to get to an implementation ? How can any one help ?
 
  Regards,
 
 
 
 
 
  On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  Well, first off, let me say that I am much less of a fan now of the
  magical
  cross validation approach and adaptation based on that than I was when I
  wrote the ALR code.  There are definitely legs in the ideas, but my
  implementation has a number of flaws.
 
  For example:
 
  a) the way that I provide for handling multiple passes through the data
 is
  very easy to screw up.  I think that simply separating the data entirely
  might be a better approach.
 
  b) for truly on-line learning where no repeated passes through the data
  will ever occur, then cross validation is not the best choice.  Much
  better
  in those cases to use what Google researchers described in [1].
 
  c) it is clear from several reports that the evolutionary algorithm
  prematurely shuts down the learning rate.  I think that Adagrad-like
  learning rates are more reliable.  See [1] again for one of the more
  readable descriptions of this.  See also [2] for another view on
 adaptive
  learning rates.
 
  d) item (c) is also related to the way that learning rates are adapted
 in
  the underlying OnlineLogisticRegression.  That needs to be fixed.
 
  e) asynchronous parallel stochastic gradient descent with mini-batch
  learning is where we should be headed.  I do not have time to write it,
  however.
 
  All this aside, I am happy to help in any way that I can given my recent
  time limits.
 
 
  [1] http://research.google.com/pubs/pub41159.html
 
  [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf
 
 
 
  On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com
  wrote:
 
   Hi-
  
   We're currently working on a binary classifier using
   Mahout's AdaptiveLogisticRegression class.  We're trying to determine
   whether or not the models are suffering from high bias or variance and
  were
   wondering how to do this using Mahout's APIs?  I can easily calculate
  the
   cross validation error and I think I could detect high bias or
 variance
  if
   I could compare that number to my training error, but I'm not sure how
  to
   do this.  Or, any other ideas would be appreciated!
  
   Thanks,
   Ian
 
 
 



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-27 Thread Ted Dunning
On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com



 Are we to assume that SGD is still a work in progress and implementations (
 Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?


They are too raw to be accepted uncritically, for sure.  They have been
used successfully in production.


 The evolutionary algorithm seems to be the core of
 OnlineLogisticRegression,
 which in turn builds up to Adaptive/Cross Fold.

 b) for truly on-line learning where no repeated passes through the data..

 What would it take to get to an implementation ? How can any one help ?


Would you like to help on this?  The amount of work required to get a
distributed asynchronous learner up is moderate, but definitely not huge.

I think that OnlineLogisticRegression is basically sound, but should get a
better learning rate update equation.  That would largely make the
Adaptive* stuff unnecessary, expecially if OLR could be used in the
distributed asynchronous learner.


Re: Good centroid generation algorithm for top-down clustering approach

2013-11-26 Thread Ted Dunning
Have you looked at the streaming k-means work?  The basic idea is that you
generate a sketch of the data which you can then cluster in-memory.  That
lets you use very advanced centroid generation algorithms that require lots
of processing.




On Tue, Nov 26, 2013 at 6:29 AM, Chih-Hsien Wu chjaso...@gmail.com wrote:

 Hi all, I'm trying to clustering text documents via top-down approach. I
 have experienced both random seed and canopy generation, and have seen
 their pros and cons. I realize that canopy is great for not known exact
 cluster numbers; nevertheless, the memory need for canopy is great. I was
 hoping to find something similar to canopy generation and was wondering if
 there is any other recommendation?



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-26 Thread Ted Dunning
Well, first off, let me say that I am much less of a fan now of the magical
cross validation approach and adaptation based on that than I was when I
wrote the ALR code.  There are definitely legs in the ideas, but my
implementation has a number of flaws.

For example:

a) the way that I provide for handling multiple passes through the data is
very easy to screw up.  I think that simply separating the data entirely
might be a better approach.

b) for truly on-line learning where no repeated passes through the data
will ever occur, then cross validation is not the best choice.  Much better
in those cases to use what Google researchers described in [1].

c) it is clear from several reports that the evolutionary algorithm
prematurely shuts down the learning rate.  I think that Adagrad-like
learning rates are more reliable.  See [1] again for one of the more
readable descriptions of this.  See also [2] for another view on adaptive
learning rates.

d) item (c) is also related to the way that learning rates are adapted in
the underlying OnlineLogisticRegression.  That needs to be fixed.

e) asynchronous parallel stochastic gradient descent with mini-batch
learning is where we should be headed.  I do not have time to write it,
however.

All this aside, I am happy to help in any way that I can given my recent
time limits.


[1] http://research.google.com/pubs/pub41159.html

[2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf



On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com wrote:

 Hi-

 We're currently working on a binary classifier using
 Mahout's AdaptiveLogisticRegression class.  We're trying to determine
 whether or not the models are suffering from high bias or variance and were
 wondering how to do this using Mahout's APIs?  I can easily calculate the
 cross validation error and I think I could detect high bias or variance if
 I could compare that number to my training error, but I'm not sure how to
 do this.  Or, any other ideas would be appreciated!

 Thanks,
 Ian


Re: Algorithms in Mahout

2013-11-25 Thread Ted Dunning
On Mon, Nov 25, 2013 at 3:14 AM, Manuel Blechschmidt 
manuel.blechschm...@gmx.de wrote:

 There are/were multiple kNN implementation in Mahout:
 Recommender knn
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.6/org/apache/mahout/cf/taste/impl/recommender/knn/Optimizer.java(will
  be removed for 0.9)
 stream knn
 https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/StreamingKMeans.java
 normal knn


Streaming k-means isn't strictly a knn implementation.  It is a k-means
clustering application.


Re: OnlineLogisticRegression: Are my settings sensible

2013-11-08 Thread Ted Dunning
You are correct that it should work with smaller data as well, but the
trade-offs are going to be very different.

In particular, some algorithms are completely infeasible at large scale,
but are very effective at small scale.  Some like those used in glmnet
inherently require multiple passes through the data.

The Mahout committers have generally elected to spend time on larger scale
problems, especially where really good small-scale solutions already exist.

That could change if somebody wanted to come in and support some set of
algorithms (hint, hint).




On Fri, Nov 8, 2013 at 3:15 AM, Andreas Bauer b...@gmx.net wrote:

 Ok,  I'll have a look. Thanks! I know mahout is intended for large scale
 machine learning,  but I guess it shouldn't have problems with such small
 data either.



 Ted Dunning ted.dunn...@gmail.com schrieb:
 On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer b...@gmx.net wrote:
 
  Hi,
 
  Thanks for your comments.
 
  I modified the examples from the mahout in action book,  therefore I
 used
  the hashed approach and that's why i used 100 features. I'll adjust
 the
  number.
 
 
 Makes sense.  But the book was doing sparse features.
 
 
 
  You say that I'm using the same CVE for all features,  so you mean i
  should create 12 separate CVE for adding features to the vector like
 this?
 
 
 Yes.  Otherwise you don't get different hashes.  With a CVE, the
 hashing
 pattern is generated from the name of the variable.  For a work
 encoder,
 the hashing pattern is generated by the name of the variable (specified
 at
 construction of the encoder) and the word itself (specified at encode
 time).  Text is just repeated words except that the weights aren't
 necessarily linear in the number of times a word appears.
 
 In your case, you could have used a goofy trick with a word encoder
 where
 the word is the variable name and the value of the variable is passed
 as
 the weight of the word.
 
 But all of this hashing is really just extra work for you.  Easier to
 just
 pack your data into a dense vector.
 
 
  Finally, I thought online logistic regression meant that it is an
 online
  algorithm so it's fine to train only once. Does it mean, should i
 invoke
  the train method over and over again with the same training sample
 until
  the next one arrives or how should i make the model converge (or at
 least
  try to with the few samples) ?
 
 
 What online really implies is that training data is measured in terms
 of
 number of input records instead of in terms of passes through the data.
  To
 converge, you have to see enough data.  If that means you need to pass
 through the data several times to fool the learner ... well, it means
 you
 have to pass through the data several times.
 
 Some online learners are exact in that they always have the exact
 result at
 hand for all the data they have seen.  Welford's algorithm for
 computing
 sample mean and variance is like that. Others approximate an answer.
 Most
 systems which are estimating some property of a distribution are
 necessarily approximate.  In fact, even Welford's method for means is
 really only approximating the mean of the distribution based on what it
 has
 seen so far.  It happens that it gives you the best possible estimate
 so
 far, but that is just because computing a mean is simple enough.  With
 regularized logistic regression, the estimation is trickier and you can
 only say that the algorithm will converge to the correct result
 eventually
 rather than say that the answer is always as good as it can be.
 
 Another way to say it is that the key property of on-line learning is
 that
 the learning takes a fixed amount of time and no additional memory for
 each
 input example.
 
 
  What would you suggest to use for incremental training instead of
 OLR?  Is
  mahout perhaps the wrong library?
 
 
 Well, for thousands of examples, anything at all will work quite well,
 even
 R.  Just keep all the data around and fit the data whenever requested.
 
 Take a look at glmnet for a very nicely done in-memory L1/L2
 regularized
 learner.  A quick experiment indicates that it will handle 200K samples
 of
 the sort you are looking in about a second with multiple levels of
 lambda
 thrown into the bargain.  Versions available in R, Matlab and Fortran
 (at
 least).
 
 http://www-stat.stanford.edu/~tibs/glmnet-matlab/
 
 This kind of in-memory, single machine problem is just not what Mahout
 is
 intended to solve.




Re: Solr-recommender for Mahout 0.9

2013-11-08 Thread Ted Dunning
For recommendation work, I suggest that it would be better to simply code
out an explicit OR query.




On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler kkrugler_li...@transpac.comwrote:

 Hi Pat,

 On Nov 7, 2013, at 7:30pm, Pat Ferrel pat.fer...@gmail.com wrote:

  Another approach would be to weight the terms in the docs by there
 Mahout similarity strength. But that will be for another day.
 
  My current question is whether Lucene looks at word proximity. I see the
 query syntax supports proximity but I don’t see that it is default so
 that’s good.

 Based on your description of what you do (generate an OR query of N terms)
 then no, you shouldn't be getting a boost from proximity.

 Note that with edismax you can specify a phrase boost, but it will be on
 the entire set of terms being searched, so unlikely to come into play even
 if you were using that.

 -- Ken


 
 
  On Nov 7, 2013, at 12:41 PM, Dyer, James james.d...@ingramcontent.com
 wrote:
 
  Best to my knowledge, Lucene does not care about the position of a
 keyword within a document.
 
  You could bucket the ids into several fields.  Then use a dismax query
 to boost the top-tier ids more than then second, etc.
 
  A more fine-grained approach would probably involve a custom Similarity
 class that scales the score based on its position in the document.  If we
 did this, it might be simpler to index as 1 single-valued field so each id
 was position+1 rather than position+100, etc.
 
  James Dyer
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Pat Ferrel [mailto:pat.fer...@gmail.com]
  Sent: Thursday, November 07, 2013 1:46 PM
  To: user@mahout.apache.org
  Subject: Re: Solr-recommender for Mahout 0.9
 
  Interesting to think about ordering and adjacentness. The index ids are
 sorted by Mahout strength so the first id is the most similar to the row
 key and so forth. But the query is ordered buy recency. In both cases the
 first id is in some sense the most important. Does Solr/Lucene care about
 closeness to the top of doc for queries or indexed docs? I don't recall any
 mention of this.
 
  However adjacentness has no meaning in recommendations though I think
 it's used in default queries so I may have to account for that.
 
  The object returned is an ordered list of ids. I use only the IDs now
 but there are cases when the contents are also of interest; shopping
 cart/watchlist queries for example.
 
  On Nov 7, 2013, at 10:00 AM, Dyer, James james.d...@ingramcontent.com
 wrote:
 
  The multivalued field will obey the positionIncrementGap value you
 specify (default=100).  So for querying purposes, those id's will be 100
 (or whatever you specified) positions apart.  So a phrase search for
 adjacent ids would not match, unless you set the slop for =
 positionIncrementGap.  Other than this, both scenarios index the same.
 
  For stored fields, solr returns an array of values for multivalued
 fields, which is convienent when writing a UI.
 
  James Dyer
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Dominik Hübner [mailto:cont...@dhuebner.com]
  Sent: Thursday, November 07, 2013 11:23 AM
  To: user@mahout.apache.org
  Subject: Re: Solr-recommender for Mahout 0.9
 
  Does anyone know what the difference is between keeping the ids in a
 space delimited string and indexing a multivalued field of ids? I recently
 tried the latter since ... it felt right, however I am not sure which of
 both has which advantages.
 
  On 07 Nov 2013, at 18:18, Pat Ferrel pat.fer...@gmail.com wrote:
 
  I have dismax (no edismax) but am not using it yet, using the default
 query, which does use 'AND'. I had much the same though as I slept on it.
 Changing to OR is now working much much better. So obvious it almost bit
 me, not good in this case...
 
  With only a trivially small amount of testing I'd say we have a new
 recommender on the block.
 
  If anyone would like to help eyeball test the thing let me know
 off-list. There are a few instructions I'll need to give. And it can't
 handle much load right now due to intentional design limits.
 
 
  On Nov 7, 2013, at 6:11 AM, Dyer, James james.d...@ingramcontent.com
 wrote:
 
  Pat,
 
  Can you give us the query it generates when you enter vampire werewolf
 zombie, q/qt/defType ?
 
  My guess is you're using the default query parser with q.op=AND , or,
 you're using dismax/edismax with a high mm (min-must-match) value.
 
  James Dyer
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Pat Ferrel [mailto:pat.fer...@gmail.com]
  Sent: Wednesday, November 06, 2013 5:53 PM
  To: s...@apache.org Schelter; user@mahout.apache.org
  Subject: Re: Solr-recommender for Mahout 0.9
 
  Done,
 
  BTW I have the thing running on a demo site but am getting very poor
 results that I think are related to the Solr setup. I'd appreciate any
 ideas.
 
  The sample data has 27,000 items and something like 4000 users. The
 preference data is 

Re: Decaying score for old preferences when using the .refresh()

2013-11-07 Thread Ted Dunning
On Thu, Nov 7, 2013 at 12:50 AM, Gokhan Capan gkhn...@gmail.com wrote:

 This particular approach is discussed, and proven to increase the accuracy
 in Collaborative filtering with Temporal Dynamics by Yehuda Koren. The
 decay function is parameterized per user, keeping track of how consistent
 the user behavior is.


Note that user-level temporal dynamics does not actually improve the
accuracy of ranking. It improves the accuracy of ratings.  Since
recommendation quality is primarily a precision@20 sort of activity,
improving ratings does no good at all.

Item-level temporal dynamics is a different beast.


Re: OnlineLogisticRegression: Are my settings sensible

2013-11-07 Thread Ted Dunning
Why is FEATURE_NUMBER != 13?

With 12 features that are already lovely and continuous, just stick them in
elements 1..12 of a 13 long vector and put a constant value at the
beginning of it.  Hashed encoding is good for sparse stuff, but confusing
for your case.

Also, it looks like you only pass through the (very small) training set
once.  The OnlineLogisticRegression is unlikely to converge very well with
such a small number of examples.

Finally, in the hashed representation that you are using, you use exactly
the same CVE to put all 15 (12?) of the variables into the vector.  Since
you are using the same CVE, all of these values will be put into exactly
the same location which is going to kill performance since you will get the
effect of summing all your variables together.





On Thu, Nov 7, 2013 at 1:48 PM, Andreas Bauer b...@gmx.net wrote:

 Hi,

 I’m trying to use OnlineLogisticRegression for a two-class classification
 problem, but as my classification results are not very good, I wanted to
 ask for support to find out if my settings are correct and if I’m using
 Mahout correctly. Because if I’m doing it correctly then probably my
 features are crap...

 In total I have 12 features. All are continuous values and all are
 normalized/standardized (has not effect on the classification performance
 at the moment).

 Training samples keep flowing in at constant rate (i.e. incremental
 training), but in total it won’t be more than a few thousand (class split
 pos/negative 30:70).

 My performance measure do not really get good, e.g. with approx. 3600
 training samples I get

 f-measure(beta=0.5): 0.38
 precision: 0.33
 recall: 0.47

 The parameters I use are

 lambda=0.0001
 offset=1000
 alpha=1
 decay_exponent=0.9
 learning_rate=50


 FEATURE_NUMBER = 100;
 CATEGORIES_NUMBER = 2;



 Java code snip:

 private OnlineLogisticRegression olr;
 private ContinuousValueEncoder continousValueEncoder;

 private static final FeatureVectorEncoder BIAS = new
 ConstantValueEncoder(Intercept“);

 …
 public Training() {
olr = new OnlineLogisticRegression(CATEGORIES_NUMBER,
 FEATURE_NUMBER,new L1()); //L2 or ElasticBandPrior do not affect the
 performance

  
 olr.lambda(lambda).learningRate(learning_rate).stepOffset(offset).decayExponent(decay_exponent);
this.continousValueEncoder = new
 ContinuousValueEncoder(ContinuousValueEncoder);
this.continousValueEncoder.setProbes(20);
   ….
 }


 public void train(TrainingSample sample, int target){
 DenseVector denseVector = new DenseVector(FEATURE_NUMBER);
 //sample.getFeatureValue1-15() return a double value
 this.continousValueEncoder.addToVector((byte[]) null,
 sample.getFeatureValue1(), denseVector);
 ….
 this.continousValueEncoder.addToVector((byte[]) null,
 sample.getFeatureValue15(), denseVector);
 BIAS.addToVector((byte[]) null, 1, denseVector);
 olr.train(target, denseVector);
 }

 It is also interesting to notice, that when I use the model both test and
 classification yield always probabilities of 1.0 or 0.99xxx for either
 class.

 result = this.olr.classifyFull(input);
 LOGGER.debug(TrainingSink test: classify real category:
 + realCategory +  olr classifier result: 
 + result.maxValueIndex() +  prob:  + result.maxValue());




 Would be great if you could give me some advise.

 Many thanks,

 Andreas





Re: OnlineLogisticRegression: Are my settings sensible

2013-11-07 Thread Ted Dunning
On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer b...@gmx.net wrote:

 Hi,

 Thanks for your comments.

 I modified the examples from the mahout in action book,  therefore I used
 the hashed approach and that's why i used 100 features. I'll adjust the
 number.


Makes sense.  But the book was doing sparse features.



 You say that I'm using the same CVE for all features,  so you mean i
 should create 12 separate CVE for adding features to the vector like this?


Yes.  Otherwise you don't get different hashes.  With a CVE, the hashing
pattern is generated from the name of the variable.  For a work encoder,
the hashing pattern is generated by the name of the variable (specified at
construction of the encoder) and the word itself (specified at encode
time).  Text is just repeated words except that the weights aren't
necessarily linear in the number of times a word appears.

In your case, you could have used a goofy trick with a word encoder where
the word is the variable name and the value of the variable is passed as
the weight of the word.

But all of this hashing is really just extra work for you.  Easier to just
pack your data into a dense vector.


 Finally, I thought online logistic regression meant that it is an online
 algorithm so it's fine to train only once. Does it mean, should i invoke
 the train method over and over again with the same training sample until
 the next one arrives or how should i make the model converge (or at least
 try to with the few samples) ?


What online really implies is that training data is measured in terms of
number of input records instead of in terms of passes through the data.  To
converge, you have to see enough data.  If that means you need to pass
through the data several times to fool the learner ... well, it means you
have to pass through the data several times.

Some online learners are exact in that they always have the exact result at
hand for all the data they have seen.  Welford's algorithm for computing
sample mean and variance is like that. Others approximate an answer.  Most
systems which are estimating some property of a distribution are
necessarily approximate.  In fact, even Welford's method for means is
really only approximating the mean of the distribution based on what it has
seen so far.  It happens that it gives you the best possible estimate so
far, but that is just because computing a mean is simple enough.  With
regularized logistic regression, the estimation is trickier and you can
only say that the algorithm will converge to the correct result eventually
rather than say that the answer is always as good as it can be.

Another way to say it is that the key property of on-line learning is that
the learning takes a fixed amount of time and no additional memory for each
input example.


 What would you suggest to use for incremental training instead of OLR?  Is
 mahout perhaps the wrong library?


Well, for thousands of examples, anything at all will work quite well, even
R.  Just keep all the data around and fit the data whenever requested.

Take a look at glmnet for a very nicely done in-memory L1/L2 regularized
learner.  A quick experiment indicates that it will handle 200K samples of
the sort you are looking in about a second with multiple levels of lambda
thrown into the bargain.  Versions available in R, Matlab and Fortran (at
least).

http://www-stat.stanford.edu/~tibs/glmnet-matlab/

This kind of in-memory, single machine problem is just not what Mahout is
intended to solve.


Re: Scheduled tasks in Mahout

2013-10-30 Thread Ted Dunning
No.  Scheduling is outside of Mahout's scope.




On Wed, Oct 30, 2013 at 12:55 PM, Cassio Melo melo.cas...@gmail.com wrote:

 I wonder if Mahout (more precisely
 org.apache.mahout.cf.taste package) has any helper class to execute
 scheduled tasks like fetch data, compute similarity, etc.

 Thank you

 Cassio



Re: TravellingSaleman

2013-10-29 Thread Ted Dunning
Actually that isn't quite correct.

Watchmaker was removed.  That was a genetic algorithm implementation.

EP or evolutionary programming still has an implementation in Mahout in the
class org.apache.mahout.ep.EvolutionaryProcess

This algorithm is documented here: http://arxiv.org/abs/0803.3838






On Tue, Oct 29, 2013 at 9:33 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 EP has been removed as of mahout 0.7

 Sent from my iPhone

  On Oct 29, 2013, at 9:31 AM, Pavan K Narayanan 
 pavan.naraya...@gmail.com wrote:
 
  Hi, is the evolutionary algorithm package is still in active
  development in Mahout? I am interested in running a sample TSP with
  some benchmark data using 0.7. I entered
 
  $ bin/mahout
 org.apache.mahout.ga.watchmaker.travellingsalesman.TravellingSalesman
 
  and got unknown program chosen error. I was actually hoping it would
  show all the options that we can use with traveling salesman. can
  anyone please give me the correct syntax? it is not even to be found
  in list of valid program names.
 
  Regards,



Re: Mahout 0.8 Random Forest Accuracy

2013-10-19 Thread Ted Dunning
Tim,

Yes, RF's are ensemble learners, but that doesn't mean that you couldn't
wrap them up with other classifiers to have a higher level ensemble.


On Sat, Oct 19, 2013 at 6:48 AM, Tim Peut t...@timpeut.com wrote:

 Thanks for the info and suggestions everyone.

 On 19 October 2013 01:00, Ted Dunning ted.dunn...@gmail.com wrote:

 On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser 
 j.barrett.straus...@gmail.com wrote:

  How difficult would it be to wrap the RF classifier into an ensemble
  learner?
 
 It is callable.  Should be relatively easy.

 I'm still becoming familiar with machine learning terminology so please
 forgive my ignorance. I thought that random forests are, by nature,
 ensemble learners? What exactly do you mean by this?



Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Ted Dunning
On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut t...@timpeut.com wrote:

 Has anyone found that Mahout's random forest doesn't perform as well as
 other implementations? If not, is there any reason why it wouldn't perform
 as well?


This is disappointing, but not entirely surprising.  There has been
considerably less effort applied to Mahouts random forest package than the
comparable R packages.

Note, particularly that the Mahout implementation is not regularized.  That
could well be a big difference.


Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Ted Dunning
On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser 
j.barrett.straus...@gmail.com wrote:

 How difficult would it be to wrap the RF classifier into an ensemble
 learner?


It is callable.  Should be relatively easy.


Re: Clustering of text data on external categories

2013-10-11 Thread Ted Dunning
Search engines do cool things.


On Fri, Oct 11, 2013 at 7:42 AM, Jens Bonerz jbon...@googlemail.com wrote:

 what a nice idea :-) really like that approach


 2013/10/11 Ted Dunning ted.dunn...@gmail.com

  You don't need Mahout for this.
 
  A very easy way to do this is to gather all the words for each category
  into a document.  Thus:
 
  CatA:selling buying sales payment
  CatB:gathering collecting
  CatC:information data info
 
  Then put these into a text retrieval engine so that you have one document
  per category.
 
  When you get a new document to categorize, just use the document as a
 query
  and you will get a list of possible categories back.  Make sure you set
 the
  default query mode to OR for this.
 
  See http://wiki.apache.org/solr/SolrQuerySyntax for more on the syntax.
 
 
 
  On Fri, Oct 11, 2013 at 5:04 AM, Kasi Subrahmanyam
  kasisubbu...@gmail.comwrote:
 
   Hi,
  
   I have a problem that i would like to implement in mahout clustering.
  
   I have input text documents with data like below.
  
   Document1: This is the first document of selling information.
   Document2: This is the second document of gathering information.
  
   I also have another look up file with data like below
   selling:CatA
   gathering:CatB.
   information:CatC
  
   NOw i would like to cluster the documents with output being genrated as
   Document1:CatA,CatC
   Document2:CatB,CatC
  
   Please let me know how to achieve this.
  
   Thanks,
   Subbu
  
 



Re: Naive bayes and character n-grams

2013-10-10 Thread Ted Dunning
For language detection, you are going to have a hard time doing better than
one of the standard packages for the purpose.  See here:

http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html


On Thu, Oct 10, 2013 at 1:01 AM, Dean Jones dean.m.jo...@gmail.com wrote:

 Hi Si,

 On 10 October 2013 07:59, simon.2.thomp...@bt.com wrote:
 
  what do you mean by character n-grams? If you mean things like ab or
 ui2 then given that there are so few characters compared to words is
 there a problem that can't be solved without a look-up table for ny (where
 y 4ish )
 
  Or are you looking at y 4 ish because if so then do you run into the
 issue of a sudden space explosion?
 

 Yes, just tokens in a text broken up into sequences of their constituent
 characters. In my initial tests, language detection works well where n=3,
 particularly when including the head and tail bigrams. So I need something
 to generate the required sequence files from my training data.



Re: Naive bayes and character n-grams

2013-10-10 Thread Ted Dunning
Cool. Sounds like you are ahead of the game.  

Sent from my iPhone

On Oct 10, 2013, at 13:15, Dean Jones dean.m.jo...@gmail.com wrote:

 On 10 October 2013 12:46, Ted Dunning ted.dunn...@gmail.com wrote:
 For language detection, you are going to have a hard time doing better than
 one of the standard packages for the purpose.  See here:
 
 http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
 
 Thanks for the pointer Ted. I'm a big fan of the Tika project, we use
 it for content extraction already. For various reasons though, we have
 rolled our own language detector (mainly, neither of these packages
 cover all of the languages we need to identify - language-detection
 doesn't do Catalan, Tika doesn't do Welsh).
 
 Dean.


Re: Naive bayes and character n-grams

2013-10-09 Thread Ted Dunning
Yes.  Should work to use character n-grams.  There are oddities in the
stats because the different n-grams are not independent, but Naive Bayes
methods are in such a state of sin that it shouldn't hurt any worse.

No... I don't think that there is a capability built in to generate the
character n-grams.  Should be relatively trivial to build.



On Wed, Oct 9, 2013 at 3:18 AM, Dean Jones dean.m.jo...@gmail.com wrote:

 Hello folks,

 I see that it's possible to use mahout to train a naive bayes
 classifier using n-grams as features (or I guess, strictly speaking,
 mahout can be used to generate sequence files containing n-grams; I
 suspect the naive bayes trainer is indifferent to the form of features
 it trains on). Is there any facility to generate character n-grams
 instead of word n-grams?

 Thanks,

 Dean.



Re: Solr-recommender

2013-10-09 Thread Ted Dunning
Mike,

Thanks for the vote of confidence!


On Wed, Oct 9, 2013 at 6:13 AM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 Just to add a note of encouragement for the idea of better integration
 between Mahout and Solr:

 On safariflow.com, we've recently converted our recommender, which
 computes similarity scores w/Mahout, from storing scores and running
 queries w/Postgres, to doing all that in Solr.  It's been a big
 improvement, both in terms of indexing speed, and more importantly, the
 flexibility of the queries we can write.  I believe that having scoring
 built in to the query engine is a key feature for recommendations.  More
 and more I am coming to believe that recommendation should just be
 considered as another facet of search: as one among many variables the
 system may take into account when presenting relevant information to the
 user.  In our system, we still clearly separate search from
 recommendations, and we probably will always do that to some extent, but I
 think we will start to blend the queries more so that there will be
 essentially a continuum of query options including more or less user
 preference data.

 I think what I'm talking about may be a bit different than what Pat is
 describing (in implementation terms), since we do LLR calculations off-line
 in Mahout and then bulk load them into Solr.  We took one of Ted's earlier
 suggestions to heart, and simply ignored the actual numeric scores: we
 index the top N similar items for each item.  Later we may incorporate
 numeric scores in Solr as term weights.  If people are looking for things
 to do :) I think that would be a great software contribution that could
 spur this effort onward since it's difficult to accomplish right now given
 the Solr/Lucene indexing interfaces, but is already supported by the
 underlying data model and query engine.


 -Mike


 On 10/2/13 12:19 PM, Pat Ferrel wrote:

 Excellent. From Ellen's description the first Music use may be an
 implicit preference based recommender using synthetic  data? I'm quickly
 discovering how flexible Solr use is in many of these cases.

 Here's another use you may have thought of:

 Shopping cart recommenders, as goes the intuition, are best modeled as
 recommending from similar item-sets. If you store all shopping carts as
 your training data (play lists, watch lists etc.) then as a user adds
 things to their cart you query for the most similar past carts. Combine the
 results intelligently and you'll have an item set recommender. Solr is
 built to do this item-set similarity. We tried to do this for a ecom site
 with pure Mahout but the similarity calc in real time stymied us. We knew
 we'd need Solr but couldn't devote the resources to spin it up.

 On the Con-side Solr has a lot of stuff you have to work around. It also
 does not have the ideal similarity measure for many uses (cosine is ok but
 llr would probably be better). You don't want stop word filtering,
 stemming, white space based tokenizing or n-grams. You would like explicit
 weighting. A good thing about Solr is how well it integrates with virtually
 any doc store independent of the indexing and query. A bit of an oval peg
 for a round hole.

 It looks like the similarity code is replaceable if not pluggable. Much
 of the rest could be trimmed away by config or adherence to conventions I
 suspect. In the demo site I'm working on I've had to adopt some slightly
 hacky conventions that I'll describe some day.

 On Oct 1, 2013, at 10:38 PM, Ted Dunning ted.dunn...@gmail.com wrote:


 Pat,

 Ellen and some folks in Britain have been working with some data I
 produced from synthetic music fans.


 On Tue, Oct 1, 2013 at 2:22 PM, Pat Ferrel p...@occamsmachete.com wrote:
 Hi Ellen,


 On Oct 1, 2013, at 12:38 PM, Ted Dunning ted.dunn...@gmail.com wrote:


 As requested,

 Pat, meet Ellen.

 Ellen, meet Pat.




 On Tue, Oct 1, 2013 at 8:46 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 Tunneling (rat-holing?) into the cross-recommender and Solr+Mahout
 version.

 Things to note:
 1) The pure Mahout XRecommenderJob needs a cross-LLR or a
 cross-similairty job. Currently there is only cooccurrence for
 sparsification, which is far from optimal. This might take the form of a
 cross RSJ with two DRMs as input. I can't commit to this but would commit
 to adding it to the XRecommenderJob.
 2) output to Solr needs a lot of options implemented and tested. The
 hand-run test should be made into some junits. I'm slowly doing this.
 3) the Solr query API is unimplemented unless someone else is working on
 that. I'm building one in a demo site but it looks to me like a static
 recommender API is not going to be all that useful and maybe a document
 describing how to do it with the Solr query interface would be best,
 especially for a first step. The reasoning here is that it is so tempting
 to mix in metadata to the recommendation query that a static API is not so
 obvious. For the demo site the recommender API

Re: Solr-recommender

2013-10-09 Thread Ted Dunning
On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 On 10/9/13 3:08 PM, Pat Ferrel wrote:

 Solr uses cosine similarity for it's queries. The implementation on
 github uses Mahout LLR for calculating the item-item similarity matrix but
 when you do the more-like-this query at runtime Solr uses cosine. This can
 be fixed in Solr, not sure how much work.

 It's not clear to me whether it's worth fixing this or not.  It would
 certainly complicate scoring calculations when mixing with traditional
 search terms.


I am pretty convinced it is not worth fixing.

This is particularly true because when you fix one count at 1 and take the
limiting form of LLR, you get something quite similar to LLR in any case.
 This means that Solr's current query is very close to what we want
theoretically ... certainly at least as close as theory is to practice.


Re: Solr-recommender

2013-10-09 Thread Ted Dunning
On Wed, Oct 9, 2013 at 2:07 PM, Pat Ferrel p...@occamsmachete.com wrote:

 2) What you are doing is something else that I was calling a shopping-cart
 recommender. You are using the item-set in the current cart and finding
 similar, what, items? A different way to tackle this is to store all other
 shopping carts then use the current cart contents as a more-like-this query
 against past carts. This will give you items-purchased-together by other
 users. If you have enough carts it might give even better results. In any
 case they will be different.



Or the shopping cart can be used as a query for the current indicator
fields.  That gives you an item-based recommendation from shopping cart
contents.

I am not sure that the more-like-this query buys all that much versus an
ordinary query on the indicator fields.


Re: Solr-recommender

2013-10-09 Thread Ted Dunning
On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 It sounds like you are doing item-item similarities for recommendations,
 not actually calculating user-history based recs, is that true?

 Yes that's true so far.  Our recommender system has the ability to provide
 recs based on user history, but we have not deployed this in our app yet.
  My plan was simply to query based on all the items in the user's basket
 - not sure that this would require a different back end?  We're not at the
 moment considering user-user similarity measures.


The items in the basket really are kind of a history (a history of hte
items placed in the basket).

It is quite reasonable to use those as a query against indicator fields.

It would be nice to generate indicators (aka binarized item-item LLR
similarities) from a number of different actions such as view, dwell,
scroll, add-to-basket and see which ones or which combos give you the best
recommendation.


Re: What are the best settings for my clustering task

2013-10-06 Thread Ted Dunning
It is there, at the very least as part of the streaming k-means code.  The 
abbreviation bkm has been used in the past.  

In looking at the code just now I don't find any command line invocation of 
bkm. It should be quite simple to write one and it would be very handy to have 
a way to run streaming k-means without a map reduce step as well. As such it 
might be good to have a new option to streaming k-means to use just bkm in a 
single thread, to use threaded streaming k-means on a single machine or to use 
MapR reduce streaming k-means.  

You up for trying to make a patch?

Sent from my iPhone

On Oct 6, 2013, at 12:37, Jens Bonerz jbon...@googlemail.com wrote:

 Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it
 in the list of available programs when calling the mahout binary...
 
 
 2013/10/3 Ted Dunning ted.dunn...@gmail.com
 
 What you are seeing here are the cluster centroids themselves, not the
 cluster assignments.
 
 Streaming k-means is a single pass algorithm to derive these centroids.
 Typically, the next step is to cluster these centroids using ball k-means.
 *Those* results can then be applied back to the original (or new) input
 vectors to get cluster assignments for individual input vectors.
 
 I don't have command line specifics handy, but you seem to have done very
 well already at figuring out the details.
 
 
 On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:
 
 I created a series of scripts to try out streamingkmeans in mahout an
 increased the number of clusters to a high amount as suggested by Ted.
 Everything seems to work. However, I can't figure out how to access the
 actual cluster data at the end of the process.
 
 It just gives me output that I cannot really understand... I would expect
 my product_ids being referenced to cluster ids...
 
 Example of the procedure's output:
 
 hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
 locally
 Input Path: file:MahoutCluster/part-r-0
 Key class: class org.apache.hadoop.io.IntWritable Value Class: class
 org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
 Key: 0: Value: key = 8678, weight = 3.00, vector =
 
 {37:26.83479118347168,6085:8.162049293518066,4785:10.3130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
 Key: 1: Value: key = 3118, weight = 14.00, vector =
 
 {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
 Key: 2: Value: key = 2867, weight = 3.00, vector =
 
 {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.61132911475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
 Key: 3: Value: key = 6295, weight = 1.00, vector =
 
 {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
 Key: 4: Value: key = 6725, weight = 4.00, vector =
 
 {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
 Key: 5:..
 
 
 
 this is my recipe:
 
 
 
 Step 1
 Create a seqfile from my data with Python. Its the product_id (key) and
 the
 short normalized descripti (value) that is written into the sequence
 file.
 
 
 
 
 
 Step 2
 create vectors from that data with the following command:
 
 mahout seq2sparse \
  -i productClusterSequenceData/productClusterSequenceData.seq \
  -o productClusterSequenceData/vectors \
 
 
 
 
 
 Step 3
 Cluster the vectors using streamingkeans with this command:
 
 mahout streamingkmeans \
 -i productClusterSequenceData/vectors/tfidf-vectors \
 -o MahoutCluster \
 --tempDir /tmp \
 -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
 -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
 -k 1 -km 50 \
 
 
 
 
 
 Step 4
 Export the streamingkmeans cluster data into a textfile (for an extract
 of
 the result see above)
 
 mahout seqdumper \
 -i MahoutCluster  similarProducts.txt
 
 What am I missing?
 
 
 
 
 
 2013/10/3 Ted Dunning ted.dunn...@gmail.com
 
 Yes.  That will work.
 
 The sketch will then contain 10,000 x log N centroids.  If N = 10^9,
 log N
 \approx 30 so the sketch will have at about 300,000 weighted centroids
 in
 it.  The final clustering will have to process these centroids to
 produce
 the desired 5,000 clusters.  Since 300,000 is a relatively small number
 of
 data points, this clustering step should proceed

Re: Editing Dictionary Vector Generated

2013-10-04 Thread Ted Dunning
Why do you say that this is unacceptable?

If the phrase is the most common way that the word English is used, this isn't 
such a bad thing.  

In general, with machine learning, the idea is to let the data speak. If the 
data say something you don't like, you have to be careful about contradicting 
it. 

That said, you might be happier with something other than naive bayes 
classifieds (which I am guessing you are using). For instance, with regularized 
logistic regression, if the bigram is sufficiently predictive then the model 
will prefer to put zero weight on the constituent unigrams.  

Sent from my iPhone

On Oct 4, 2013, at 9:50, Puneet Arora arorapuneet2...@gmail.com wrote:

 anti is marked as negative which also acceptable but
 it is also taking English as negative which is not acceptable


Re: What are the best settings for my clustering task

2013-10-04 Thread Ted Dunning
What you are seeing here are the cluster centroids themselves, not the cluster 
assignments.

Streaming k-means is a single pass algorithm to derive these centroids.  
Typically, the next step is to cluster these centroids using ball k-means.  
*Those* results can then be applied back to the original (or new) input vectors 
to get cluster assignments for individual input vectors.

I don't have command line specifics handy, but you seem to have done very well 
already at figuring out the details.


On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:

 I created a series of scripts to try out streamingkmeans in mahout an
 increased the number of clusters to a high amount as suggested by Ted.
 Everything seems to work. However, I can't figure out how to access the
 actual cluster data at the end of the process.
 
 It just gives me output that I cannot really understand... I would expect
 my product_ids being referenced to cluster ids...
 
 Example of the procedure's output:
 
 hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
 locally
 Input Path: file:MahoutCluster/part-r-0
 Key class: class org.apache.hadoop.io.IntWritable Value Class: class
 org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
 Key: 0: Value: key = 8678, weight = 3.00, vector =
 {37:26.83479118347168,6085:8.162049293518066,4785:10.3130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
 Key: 1: Value: key = 3118, weight = 14.00, vector =
 {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
 Key: 2: Value: key = 2867, weight = 3.00, vector =
 {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.61132911475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
 Key: 3: Value: key = 6295, weight = 1.00, vector =
 {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
 Key: 4: Value: key = 6725, weight = 4.00, vector =
 {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
 Key: 5:..
 
 
 
 this is my recipe:
 
 
 Step 1
 Create a seqfile from my data with Python. Its the product_id (key) and the
 short normalized descripti (value) that is written into the sequence file.
 
 
 
 
 Step 2
 create vectors from that data with the following command:
 
 mahout seq2sparse \
   -i productClusterSequenceData/productClusterSequenceData.seq \
   -o productClusterSequenceData/vectors \
 
 
 
 
 Step 3
 Cluster the vectors using streamingkeans with this command:
 
 mahout streamingkmeans \
 -i productClusterSequenceData/vectors/tfidf-vectors \
 -o MahoutCluster \
 --tempDir /tmp \
 -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
 -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
 -k 1 -km 50 \
 
 
 
 
 Step 4
 Export the streamingkmeans cluster data into a textfile (for an extract of
 the result see above)
 
 mahout seqdumper \
 -i MahoutCluster  similarProducts.txt
 
 What am I missing?
 
 
 
 
 
 2013/10/3 Ted Dunning ted.dunn...@gmail.com
 
 Yes.  That will work.
 
 The sketch will then contain 10,000 x log N centroids.  If N = 10^9, log N
 \approx 30 so the sketch will have at about 300,000 weighted centroids in
 it.  The final clustering will have to process these centroids to produce
 the desired 5,000 clusters.  Since 300,000 is a relatively small number of
 data points, this clustering step should proceed relatively quickly.
 
 
 
 On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz jbon...@googlemail.com
 wrote:
 
 thx for your elaborate answer.
 
 so if the upper bound on the final number of clusters is unknown in the
 beginning, what would happen, if I define a very high number that is
 guaranteed to be  the estimated number of clusters.
 for example if I set it to 10.000 clusters if an estimate of 5.000 is
 likely, will that work?
 
 
 2013/10/2 Ted Dunning ted.dunn...@gmail.com
 
 The way that the new streaming k-means works is that there is a first
 sketch pass which only requires an upper bound on the final number of
 clusters you will want.  It adaptively creates more or less clusters
 depending on the data and your bound.  This sketch is guaranteed to be
 computed within at most one map-reduce pass.  There is a threaded
 version
 that runs (fast) on a single machine.  The threaded version

Re: Editing Dictionary Vector Generated

2013-10-04 Thread Ted Dunning
On Fri, Oct 4, 2013 at 6:13 AM, Puneet Arora arorapuneet2...@gmail.comwrote:

 yes you guessed correct that I am using naive bayes, but how can I handle
 this type of problem.



I didn't hear about a problem.

You said you didn't like weights on words like English to reflect the fact
that they are used in certain contexts.

I said that this is the way it should work.

Unless you demonstrate that you increase accuracy by changing the weights,
I don't know how to go further.  Other algorithms are specifically designed
so that if the weights on English are redundant, then they will be set to
near zero.  Naive bayes purposely ignores such redundancy in order to be
simpler.


Re: What are the best settings for my clustering task

2013-10-02 Thread Ted Dunning
The way that the new streaming k-means works is that there is a first
sketch pass which only requires an upper bound on the final number of
clusters you will want.  It adaptively creates more or less clusters
depending on the data and your bound.  This sketch is guaranteed to be
computed within at most one map-reduce pass.  There is a threaded version
that runs (fast) on a single machine.  The threaded version is liable to be
faster than the map-reduce version for moderate or smaller data sizes.

That sketch can then be used to do all kinds of things that rely on
Euclidean distance and still get results within a small factor of the same
algorithm applied to all of the data.  Typically this second phase is a
ball k-means algorithm, but it could easily be a dp-means algorithm [1] if
you want a variable number of clusters.  Indeed, you could run many
dp-means passes with different values of lambda on the same sketch.  Note
that the sketch is small enough that in-memory clustering is entirely
viable and is very fast.

For the problem you describe, however, you probably don't need the sketch
approach at all and can probably apply ball k-means or dp-means directly.
 Running many k-means clusterings with differing values of k should be
entirely feasible as well with such data sizes.

[1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf




On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz jbon...@googlemail.com wrote:

 Isn't the streaming k-means just a different approach to crunch through the
 data? In other words, the result of streaming k-means should be comparable
 to using k-means in multiple chained map reduce cycles?

 I just read a paper about the k-means clustering and its underlying
 algorithm.

 According to that paper, k-means relies on a preknown/predefined amount of
 clusters as an input parameter.

 Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf

 In my current scenario however, the number of clusters is unknown at the
 beginning.

 Maybe k-means is just not the right algorithm for clustering similar
 products based on their short description text? What else could I use?




 2013/10/1 Ted Dunning ted.dunn...@gmail.com

  At such small sizes, I would guess that the sequential version of the
  streaming k-means or ball k-means would be better options.
 
 
 
  On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 jbon...@googlemail.com
  wrote:
 
   Hello all,
  
   I am currently trying create clusters from a group of 50.000 strings
 that
   contain product descriptions (around 70-100 characters length each).
  
   That group of 50.000 consists of roughly 5.000 individual products and
  ten
   varying product descriptions per product. The product descriptions are
   already prepared for clustering and contain a normalized brand name,
   product
   model number, etc.
  
   What would be a good approach to maximise the amound of found clusters
  (the
   best possible value would be 5.000 clusters with 10 products each)
  
   I adapted the reuters cluster script to read in my data and managed to
   create a first set of clusters. However, I have not managed to maximise
  the
   cluster count.
  
   The question is: what do I need to tweak with regard to the available
   mahout
   settings, so the clusters are created as precisely as possible?
  
   Many regards!
   Jens
  
  
  
  
  
   --
   View this message in context:
  
 
 http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
   Sent from the Mahout User List mailing list archive at Nabble.com.
  
 



Re: What are the best settings for my clustering task

2013-10-02 Thread Ted Dunning
Yes.  That will work.

The sketch will then contain 10,000 x log N centroids.  If N = 10^9, log N
\approx 30 so the sketch will have at about 300,000 weighted centroids in
it.  The final clustering will have to process these centroids to produce
the desired 5,000 clusters.  Since 300,000 is a relatively small number of
data points, this clustering step should proceed relatively quickly.



On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz jbon...@googlemail.com wrote:

 thx for your elaborate answer.

 so if the upper bound on the final number of clusters is unknown in the
 beginning, what would happen, if I define a very high number that is
 guaranteed to be  the estimated number of clusters.
 for example if I set it to 10.000 clusters if an estimate of 5.000 is
 likely, will that work?


 2013/10/2 Ted Dunning ted.dunn...@gmail.com

  The way that the new streaming k-means works is that there is a first
  sketch pass which only requires an upper bound on the final number of
  clusters you will want.  It adaptively creates more or less clusters
  depending on the data and your bound.  This sketch is guaranteed to be
  computed within at most one map-reduce pass.  There is a threaded version
  that runs (fast) on a single machine.  The threaded version is liable to
 be
  faster than the map-reduce version for moderate or smaller data sizes.
 
  That sketch can then be used to do all kinds of things that rely on
  Euclidean distance and still get results within a small factor of the
 same
  algorithm applied to all of the data.  Typically this second phase is a
  ball k-means algorithm, but it could easily be a dp-means algorithm [1]
 if
  you want a variable number of clusters.  Indeed, you could run many
  dp-means passes with different values of lambda on the same sketch.  Note
  that the sketch is small enough that in-memory clustering is entirely
  viable and is very fast.
 
  For the problem you describe, however, you probably don't need the sketch
  approach at all and can probably apply ball k-means or dp-means directly.
   Running many k-means clusterings with differing values of k should be
  entirely feasible as well with such data sizes.
 
  [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
 
 
 
 
  On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz jbon...@googlemail.com
  wrote:
 
   Isn't the streaming k-means just a different approach to crunch through
  the
   data? In other words, the result of streaming k-means should be
  comparable
   to using k-means in multiple chained map reduce cycles?
  
   I just read a paper about the k-means clustering and its underlying
   algorithm.
  
   According to that paper, k-means relies on a preknown/predefined amount
  of
   clusters as an input parameter.
  
   Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
  
   In my current scenario however, the number of clusters is unknown at
 the
   beginning.
  
   Maybe k-means is just not the right algorithm for clustering similar
   products based on their short description text? What else could I use?
  
  
  
  
   2013/10/1 Ted Dunning ted.dunn...@gmail.com
  
At such small sizes, I would guess that the sequential version of the
streaming k-means or ball k-means would be better options.
   
   
   
On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 
 jbon...@googlemail.com
wrote:
   
 Hello all,

 I am currently trying create clusters from a group of 50.000
 strings
   that
 contain product descriptions (around 70-100 characters length
 each).

 That group of 50.000 consists of roughly 5.000 individual products
  and
ten
 varying product descriptions per product. The product descriptions
  are
 already prepared for clustering and contain a normalized brand
 name,
 product
 model number, etc.

 What would be a good approach to maximise the amound of found
  clusters
(the
 best possible value would be 5.000 clusters with 10 products each)

 I adapted the reuters cluster script to read in my data and managed
  to
 create a first set of clusters. However, I have not managed to
  maximise
the
 cluster count.

 The question is: what do I need to tweak with regard to the
 available
 mahout
 settings, so the clusters are created as precisely as possible?

 Many regards!
 Jens





 --
 View this message in context:

   
  
 
 http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
 Sent from the Mahout User List mailing list archive at Nabble.com.

   
  
 



 --
 CEO
 Hightech Marketing Group
 Cell/Mobile: +49 173 539 3588

 

 Hightech Marketing Group
 Frankenstraße 32
 50354 Huerth
 Germany
 Phone: +49 (0)2233 – 619 2741
 Fax: +49 (0)2233 – 619 27419
 Web: www.hightechmg.com



Re: What are the best settings for my clustering task

2013-10-01 Thread Ted Dunning
At such small sizes, I would guess that the sequential version of the
streaming k-means or ball k-means would be better options.



On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 jbon...@googlemail.comwrote:

 Hello all,

 I am currently trying create clusters from a group of 50.000 strings that
 contain product descriptions (around 70-100 characters length each).

 That group of 50.000 consists of roughly 5.000 individual products and ten
 varying product descriptions per product. The product descriptions are
 already prepared for clustering and contain a normalized brand name,
 product
 model number, etc.

 What would be a good approach to maximise the amound of found clusters (the
 best possible value would be 5.000 clusters with 10 products each)

 I adapted the reuters cluster script to read in my data and managed to
 create a first set of clusters. However, I have not managed to maximise the
 cluster count.

 The question is: what do I need to tweak with regard to the available
 mahout
 settings, so the clusters are created as precisely as possible?

 Many regards!
 Jens





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
 Sent from the Mahout User List mailing list archive at Nabble.com.



Re: Multidimensional log-likelihood similarity

2013-09-29 Thread Ted Dunning
Yes.  You can turn the normal item-item relationships around to get this.

What you have is an item x feature matrix.  Normally, one has a user x item
matrix in cooccurrence analysis and you get an item x item matrix.

If you consider the features to be users in the computation, then the
resulting indicator matrix would be just what you want.

The basic idea is that items would be related if they share features.  Two
items that have the same feature would be said to co-occur on that feature.
 Finding anomalous cooccurrence would be what you need to do to find items
that co-occur on many features.

This works by building a small 2x2 matrix that relates item A and item B.
 The entries would be feature counts.  The upper left entry of the matrix
is the number of features that A and B both have, the upper right is the
number of features that B has that A does not and so on. Put another way,
the columns represent features that A has or does not have respectively and
the rows represent the features that B has or does not have respectively.
 Items that give high root log-likelihood ratio values should considered
connected.  Those that have small values for root LLR should be considered
not connected.  The value of the root-LLR should be discarded after
thresholding and should not be considered a measure of the strength of the
relationship.

I would recommend the same down-sampling that the rowSimilarityJob already
does.





On Sun, Sep 29, 2013 at 3:40 AM, Mridul Kapoor mridulkap...@gmail.comwrote:

 Hi

 I have records - items - with many features.
 Something like

 ID, feature1, feature2, ..., featureN
 

 Can I leverage Mahout's log-likelihood similarity metrics for calculating
 the K-Most similar items to a given item X?

 -
 Thanks
 Mridul



Re: Mahout in one PC - multiple cores processor

2013-09-21 Thread Ted Dunning
Just runs in one process.  

Sent from my iPhone

On Sep 20, 2013, at 11:32, Fernando Santos fernandoleandro1...@gmail.com 
wrote:

 Thanks for the help guys.
 
 But these parts of Mahout that don't work with Hadoop also works with some
 other distributed file system or it just runs in one process?
 
 
 
 2013/9/20 Ted Dunning ted.dunn...@gmail.com
 
 It also depends on what you are doing.  Several parts of Mahout have non
 Hadoop versions.
 
 
 On Fri, Sep 20, 2013 at 5:53 AM, parnab kumar parnab.2...@gmail.com
 wrote:
 
 It is always possible to run mahout without a cluster on a single machine
 but donot expect too much performance gain on it if you are using a huge
 data set.Such a set up is primarily meant for development and testing
 purpose on small datasets. If you have a machine with many cores , you
 can
 configure hadoop in pseudo cluster mode and then point mahout to  hadoop
 directory . Set the number of map and reduce slots in the hadoop conf
 file
 to properly utilize the cores of your processor.
 
 Thanks,
 Parnab
 
 
 On Fri, Sep 20, 2013 at 5:27 PM, Fernando Santos 
 fernandoleandro1...@gmail.com wrote:
 
 Hello everyone,
 
 I'm working with some classification tasks that are taking long time do
 be
 processed. So looking for a solution I found Mahout.
 
 Does anyone know if using Mahout without any cluster, just in my
 computer,
 it gives better performance than not using it? I mean, is it possible
 to
 treat the different cores of my computer's processor as they were a
 cluster
 of other machines?
 
 Thanks!
 
 --
 Fernando Santos
 +55 61 8129 8505
 
 
 
 -- 
 Fernando Santos
 +55 61 8129 8505


Re: Mahout in one PC - multiple cores processor

2013-09-20 Thread Ted Dunning
It also depends on what you are doing.  Several parts of Mahout have non
Hadoop versions.


On Fri, Sep 20, 2013 at 5:53 AM, parnab kumar parnab.2...@gmail.com wrote:

 It is always possible to run mahout without a cluster on a single machine
 but donot expect too much performance gain on it if you are using a huge
 data set.Such a set up is primarily meant for development and testing
 purpose on small datasets. If you have a machine with many cores , you can
 configure hadoop in pseudo cluster mode and then point mahout to  hadoop
 directory . Set the number of map and reduce slots in the hadoop conf file
 to properly utilize the cores of your processor.

 Thanks,
 Parnab


 On Fri, Sep 20, 2013 at 5:27 PM, Fernando Santos 
 fernandoleandro1...@gmail.com wrote:

  Hello everyone,
 
  I'm working with some classification tasks that are taking long time do
 be
  processed. So looking for a solution I found Mahout.
 
  Does anyone know if using Mahout without any cluster, just in my
 computer,
  it gives better performance than not using it? I mean, is it possible to
  treat the different cores of my computer's processor as they were a
 cluster
  of other machines?
 
  Thanks!
 
  --
  Fernando Santos
  +55 61 8129 8505
 



Re: Clustering algorithms

2013-09-17 Thread Ted Dunning
Right now the best in terms of speed without losing quality in Mahout is
the streaming k-means implementation.

One exciting possibility is that you probably can combine a streaming
k-means pre-pass with a regularized k-means algorithm in order to get
results more like Lingo.  You could also follow with a DP-means pass to get
an idea of optimal number of clusters.

The idea with streaming k-means is that a first pass does a rough
clustering into a whole lot of clusters.  This pass is fast because only
approximate search is needed.  It is also adaptive so you only have to
specify very roughly how many clusters you will probably be interested in
having later.  The output is an approximate k-means clustering with many
more clusters than you asked for.  This output can then be clustered in
memory using any weighted clustering algorithm you care to use.  For
k-means and certain kinds of data, you can even get nice probabilistic
accuracy bounds for the combo.



On Tue, Sep 17, 2013 at 12:06 PM, Mike Hugo m...@piragua.com wrote:

 Hello,

 I'm new to mahout but have been working with Solr, Carrot2 and clustering
 documents with the Lingo algorithm.  This has worked well for us for
 clustering small sets of search results, but we are now branching out into
 wanting to cluster larger sets of documents (millions of documents to 10s
 of millions of document for now).

 Could someone point me in the right direction as to which of the clustering
 algorithms I should take a look at first (that would be similar to Lingo)?

 Thanks,

 Mike



Re: Tuning parameters for ALS-WR

2013-09-11 Thread Ted Dunning
On Wed, Sep 11, 2013 at 12:07 AM, Sean Owen sro...@gmail.com wrote:

  2. Do we have to tune the similarityclass parameter in item-based CF?
 If
  so, do we compare the mean average precision values based on validation
  data, and then report the same for the test set?
 
 
 Yes you are conceptually looking over the entire hyper-parameter space. If
 the similarity metric is one of those, you are trying different metrics.
 Grid search, just brute-force trying combinations, works for 1-2
 hyper-parameters. Otherwise I'd try randomly choosing parameters, really,
 or else it will take way too long to explore. You try to pick
 hyper-parameters 'nearer' to those that have yielded better scores.


Or use a real exploration algorithm.  For my favorite (hear that horn
blowing?) see this article on recorded step
meta-mutation.http://arxiv.org/abs/0803.3838
The idea is a randomized search, but with something akin to momentum.  This
lets you search nasty landscapes with pretty pretty good robustness and
smooth ones with fast convergence.  The code and theory are simple and
there is an implementation in Mahout.


Re: Tuning parameters for ALS-WR

2013-09-10 Thread Ted Dunning
You definitely need to separate into three sets.

Another way to put it is that with cross validation, any learning algorithm
needs to have test data withheld from it.  The remaining data is training
data to be used by the learning algorithm.

Some training algorithms such as the one that you describe divide their
training data into portions so that they can learn hyper-parameters
separately from parameters.  Whether the learning algorithm does this or
uses some other technique to come to a final value for the model has no
bearing on whether the original test data is withheld and because the test
data has to be unconditionally withheld, any sub-division of the training
data cannot include any of the test data.

In your case, you hold back 25% test data.  Then you divide the remaining
75% into 25% validation and 50% training.  The validation set has to be
separate from the 50% in order to avoid over-fitting, but the test data has
to be separate from the training+validation for the same reason.





On Tue, Sep 10, 2013 at 4:22 PM, Parimi Rohit rohit.par...@gmail.comwrote:

 Hi All,

 I was wondering if there is any experimental design to tune the parameters
 of ALS algorithm in mahout, so that we can compare its recommendations with
 recommendations from another algorithm.

 My datasets have implicit data and would like to use the following design
 for tuning the ALS parameters (alphs, lambda, numfeatures).

 1. Split the data such that for each user, 50% of the clicks go to train,
 25% go to validation, 25% goes to test.

 2. Create the user and item features by applying the ALS algorithm on
 training data, and test on the validation set. (We can pick the parameters
 which minimizes the RMSE score, in-case of implicit data, Pui - XY’)
 3. Once we find the parameters which give the best RMSE value on
 validation, use the user and item matrices generated for those parameters
 to predict the top k items and test it with the items in the test set
 (compute mean average precision).

 Although the above setting looks good, I have few questions

 1. Do we have to follow this setting, to compare algorithms? Can't we
 report the parameter combination for which we get highest mean average
 precision for the test data, when trained on the train set, with out any
 validation set.
 2. Do we have to tune the similarityclass parameter in item-based CF? If
 so, do we compare the mean average precision values based on validation
 data, and then report the same for the test set?

 My ultimate objective is to compare different algorithms but I am confused
 as to how to compare the best results (based on parameter tuning) between
 algorithms. Are there any publications that explain this in detail? Any
 help/comments about the design of experiments is much appreciated.

 Thanks,
 Rohit



Re: Solr recommender

2013-09-07 Thread Ted Dunning
On Fri, Sep 6, 2013 at 9:33 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 One of the unique things about the Solr recommender is online recs. Two
 scenarios come to mind:
 1) ask the user to pick from among a list of videos, taking the picks as
 preferences and making recs. Make more and see if recs improve.
 2) watch the users' detail views during a browsing session and make recs
 based on those in realtime. A sort of are you looking for something like
 this? recommender.

 For #1 I've seen several examples (BTW very few give instant recs). Not
 sure how they pick what to rate. It seems to me a mix of popular and the
 videos with the most varying ratings would be best. Since we have thumbs up
 and down it would be simple to find individual videos with a high degree of
 both love and hate. Intuitively this would seem to help find the birds of a
 feather among the reviewers and help put the user in with the right set
 with the fewest preferences required.


For #1, Ken's suggestion of clustering seems quite reasonable.  The only
diff is that I would tend to pick something near the centroid of the
cluster *and* that is very popular.  You need to have something people will
recognize.

Clustering can be done by doing SVD or ALS on the user x thing matrix first
or by directly clustering the columns of the user x thing matrix after some
kind of IDF weighting.  I think that only the streaming k-means currently
does well on sparse vectors.


 #2 seems straightforward. No idea if it will be useful. If #2 doesn't seem
 useful is may be modified to become the typical, makes recs based on all
 reviews but also includes recent reviews not yet in the training data.
 That's OK since we'd want to do it anyway.


For #2, I think that this is a great example of multi-modal
recommendations.  You have browsing behavior and your tomatoes-reviews
behavior.  Combining that allows you to recommend for people who have only
one kind of behavior.  Of course, our viewing behavior will be very sparse
to start.


Re: Hadoop implementation of ParallelSGDFactorizer

2013-09-07 Thread Ted Dunning
That means If I Recall Correctly.  It is an internet slang.

See also http://en.wiktionary.org/wiki/Appendix:English_internet_slang


On Sat, Sep 7, 2013 at 12:39 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote:

 Sebastian, what is IIRC?

 On Sat, Sep 7, 2013 at 8:24 PM, Sebastian Schelter
 ssc.o...@googlemail.com wrote:
  IIRC the algorithm behind ParallelSGDFactorizer needs shared memory,
  which is not given in a shared-nothing environment.
 
 
  On 07.09.2013 19:08, Tevfik Aytekin wrote:
  Hi,
  There seems to be no Hadoop implementation of ParallelSGDFactorizer.
  ALSWRFactorizer has a Hadoop implementation.
 
  ParallelSGDFactorizer (since it is based on stochastic gradient
  descent) is much faster than ALSWRFactorizer.
 
  I don't know Hadoop much. But it seems to me that a Hadoop
  implementation of ParallelSGDFactorizer will also be much faster than
  the Hadoop implementaion of ALSWRFactorizer.
 
  Is there a specific reason for why there is no Hadoop implementation
  of ParallelSGDFactorizer? Is it because since Hadoop operations are
  already slow the slowness of ALSWRFactorizer does not matter much. Or
  is it simply because nobody has implemented it yet?
 
  Thanks
  Tevfik
 
 



Re: Mahout readable output

2013-09-07 Thread Ted Dunning
Darius comments are good.

You also have to think about what similar means to you.  From the data you
describe, I see several possibilities:

- geo-location from machine id (if it includes IP address)

- content from the query

- frequency of posting

- diurnal phase of posting (tells us time zone)

Once you know what similar means, you can meaningfully talk about next
steps.

If you assume that only query content matters, then I would go towards
several ways.

- cluster directly based on query histories using IDF weighting (likely to
be kinda sorta lousy results)

- use cooccurrence analysis to augment query histories and repeat the
clustering

- use SVD or ALS to generate user vectors and query term vectors and
cluster users using user vectors and then look for coherence.

If you want to use geo, the question of scaling comes in.

If you want to use time, you have to derive some sort of features.  I find
latent variable methods useful for this.



On Fri, Sep 6, 2013 at 1:25 AM, Darius Miliauskas 
dariui.miliaus...@gmail.com wrote:

 Dear Vishal,

 can you give some code how you performed your mentioned steps:

  #) Created custom VectorIterable by inheriting IterableVector.
  #) Created custom VectorItertor by inheriting AbstractIteratorVector
  #) Model class which will be responsible to pass attribute values
 (username or data etc) to custom VectorIterator
  #) Custom VectorIterator.computeNext() will read line, create dense
 vector having size equal to number of attribute in a row.

 Can you compile the code?


 Best,

 Darius



 2013/9/6 Vishal Danech vishal.dan...@gmail.com

  Hi
 
  I have a custom log data which contains following details.
 
  1) UserName
  2) MachineId
  3) DateTime
  4) Data - which contains text - search term etc
 
  I would like to use this data to know
   #) how much time they are spending on browsing etc.
   #) User based search pattern
 
  First problem can be addressed using Hive query.
 
  For second problem, I suppose clustering can be applied and for this I
 have
  converted data to vectors. I have used dense vector and applied Canopy
  algorithm on it. I got an output which I provided as an input to
  ClusterDump utility but the out I got was not in readable form, I figured
  out that I need to use named vectors so that Key can be displayed as a
  output. Here I am facing issue, how to use NamedVector ?
 
  I am performing following steps to generate vectors..
   #) Created custom VectorIterable by inheriting IterableVector.
   #) Created custom VectorItertor by inheriting
 AbstractIteratorVector
   #) Model class which will be responsible to pass attribute values
  (username or data etc) to custom VectorIterator
   #) Custom VectorIterator.computeNext() will read line, create dense
  vector having size equal to number of attribute in a row.
 
  Please let me know how to add NamedVector here so that I can get some
  readable output from ClusterDump utility.
 
  --
  Thanks and Regards
  Vishal Danech
 



Re: Solr recommender

2013-09-07 Thread Ted Dunning
On Sat, Sep 7, 2013 at 2:35 PM, Pat Ferrel p...@occamsmachete.com wrote:

 ...
 
  Clustering can be done by doing SVD or ALS on the user x thing matrix
 first
  or by directly clustering the columns of the user x thing matrix after
 some
  kind of IDF weighting.  I think that only the streaming k-means currently
  does well on sparse vectors.
 

 Was thinking about filtering out all but the top x% of items to get things
 the user is likely to have heard about if not seen. Do this before any
 factorizing or clustering.


Hmm...

My reflex would be to trim *after* clustering so that clustering has the
benefit of the long-tail.


 ...
  For #2, I think that this is a great example of multi-modal
  recommendations.  You have browsing behavior and your tomatoes-reviews
  behavior.  Combining that allows you to recommend for people who have
 only
  one kind of behavior.  Of course, our viewing behavior will be very
 sparse
  to start.

 Yes, that's why I'm not convinced it will be useful but an interesting
 experiment now that we have the online Solr recommender. Soon we'll have
 category and description metadata from the crawler. We can experiment with
 things like category boosting if a category trend emerges during the
 browsing session and I suspect it often does--maybe release date etc. The
 ease of mixing metadata with behavior is another thing worth experimenting
 with.


Cool.

And remember meta-data becomes behavior when you interact with an item
since you have just interacted with the meta-data as well.

Btw... I am spinning up a team internally and a team at a partner site to
help with the Mahout demo.  I am trying to generate realistic music
consumption data this weekend as well.


Re: lucene.vectors not working

2013-09-06 Thread Ted Dunning
Ahh...

That makes a lot of sense.



On Thu, Sep 5, 2013 at 11:38 PM, Lauren Massa-Lochridge
laurl...@ieee.orgwrote:

 Ted Dunning ted.dunning at gmail.com writes:

 
  OK.
 
  So the easy answer strikes out.
 
  On Sat, Aug 3, 2013 at 5:04 AM, Swami Kevala 
  swami.kevala at ishafoundation.org wrote:
 
   Ted Dunning ted.dunning at gmail.com writes:
  
   
Does your index actually have term vectors?
   
On Fri, Aug 2, 2013 at 9:00 PM, Swami Kevala 
swami.kevala at ishafoundation.org wrote:
   
  
   Well yes... I used the example data that was supplied with the Solr
 4.3.1
   installation. I checked the schema before posting the example docs to
 the
   index, and it already had the option termVectors=true set for the
   includes
   field by default
  
  
 

 I've had the same error message only once, using a schema I've had in use
 over multiple version upgrades. I.e. a schema known to be correctly
 configured for term vectors.
 I hadn't noticed that only a miniscule count of documents had been indexed.
 If I recall correctly, it was well  100.

 I never use the example data, but I would check to see that its all really
 indexed or try a larger data set in case something changed relative to the
 example data.

 Lauren Massa-Lochridge
 AC7IONABL3






Re: using KmeansDriver with HDFS

2013-09-05 Thread Ted Dunning
On Wed, Sep 4, 2013 at 6:58 PM, Alan Krumholz alan_krumh...@yahoo.com.mxwrote:

 I pulled that code
 (org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:215)and
 I think is trying to read a file from one of the paths I passed to the
 method but with a new instance of the configuration object (not the
 configuration object I passed to the method but one that doesn't have my
 HDFS configured)



This is quite plausibly a bug.  This is a common error when using the HDFS
API.

Have you checked what happens with 0.8?


Re: Has anyone implemented true L-LDA out of Mahout?

2013-09-05 Thread Ted Dunning
I haven't seen any discussion of this other than what you reference.


On Thu, Sep 5, 2013 at 7:59 AM, Henry Lee honesthe...@gmail.com wrote:

 I am about to implement Jake Mannix's suggestion out of Twitter fork.

 Has anyone already implemented true L-LDA out of Mahout?

 http://markmail.org/message/cm2a6rnxblj5azuh

 over this fork?


 https://github.com/twitter/mahout/blob/master/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0PriorMapper.java

 Thanks,
 Henry Lee



Re: Tweaking ALS models to filter out highly related items when an item has been purchased

2013-09-05 Thread Ted Dunning
I think that Dominik's comments are exactly on target.

As far as implementation is concerned, I think that it is very important to
not distort the basic recommendation algorithm with business rules like
this.  It is much better to post-process the results to impose your will
directly.  One exception to this is that I think it is reasonable to use
ordered cooccurrence and also repeated cooccurrence here for some hints
here.  This lets you determine likely accessories (purchased after the main
item, mostly) and also find razor-blades (highly repetitive purchases).
 You still have the problem of flooding with similar items.

The diversity that you are talking about is a critical quality in
recommendation results.  The basic intuition is that recommendation results
are not individual recommendations, but are included in a portfolio of
recommendations.  You need the diversity in this portfolio because if you
are wrong about an item, the likelihood of being wrong about very similar
items is high.  If you flood the first and second pages with these similar
items, then you don't have room for the alternative items that might well
be correct.

My approach in the past was to define heuristic definitions for too
similar and do a pass over the sorted recommendation results giving each
item that passes the too-similar criterion a penalty score.  When done with
this, I re-sort the results and the duplicative content falls to the bottom
of the recommendations.



On Thu, Sep 5, 2013 at 1:15 AM, Dominik Hübner cont...@dhuebner.com wrote:

 Just a quick a assumption, maybe I have not thought this through enough:

 1. Users probably tend to compare products = similar VIEWS
 2. User as well might tend to PURCHASE accessory products, like the laptop
 bag you mentioned

 May be you could filter out products that have a similarity computed from
 the product views, but leave those similar, based on purchases, in your
 recommendation set?

 Nevertheless, I guess this will be strongly depending on the domain the
 data comes from.


 On Sep 5, 2013, at 10:07 AM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

  Hi all
 
  Say I have a set of ecommerce data (views, purchases etc). I've built my
  model using implicit feedback ALS. Now, I want to add a little bit of
  smart filtering.
 
  Filtering based on not recommending something that has been purchased is
  straightforward, but I'd like to also filter so as not to recommend
 highly
  similar items to someone who has purchased an item.
 
  In other words, if someone has just purchased a laptop, then I'd like to
  not recommend other laptops. Ideally while still recommending related
  items such as laptop bags, mouse etc etc. (this is just an example).
 
  Now, I could filter based on metadata tags like category, but assuming
 I
  don't always have that data, then simplistically I have the option of
  filtering out products based on those that have high cosine similarity to
  the purchased products. However, this risks filtering out good similar
  products (like the laptop bags) as well as the bad similar products.
 
  I'm experimenting with building a second variant of the model that
  effectively downweights views to near zero, hence leaving something
 sort
  of like a purchased together model variant. Then recommendations can be
  made using this model when a user purchases an item (or perhaps a
 re-scorer
  that is a weighted variant of model A and model B but that tends to
 weight
  model B - the purchased together model - higher)
 
  Are there other mechanisms to tweak the ALS model such that it tends
  towards recommending related products (but not highly similar of the
  exact same narrow product type)?
 
  Any other ideas about how best to go about this?
 
  Many thanks
  Nick




Re: ALS and SVD feature vectors

2013-09-04 Thread Ted Dunning
On Wed, Sep 4, 2013 at 10:59 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

  Now, what happens in the case of SVD?
  The vectors are normal by definition.
  Are singular values used at all, or just left and right singular vectors?

 SVD does not take weights so it cannot ignore or weigh out a
 non-observation, which is why it is not well suited for matrix
 completion problem per se


There are multiple ways to read the use of weights here.

In the original posting, I think the gist was how to treat the singular
values, not how to weight different observations.  Mahout's SSVD allows the
singular values to be kept separate, to be applied entirely to the left or
right singular values or to be split across both in a square root sort of
way.


Re: Cannot build source version mahout-distribution-0.8

2013-08-27 Thread Ted Dunning
You also have to watch out in the case of web errors. Maven can store an
error message instead of a well formed file in your repo leading to all
kinds of confusion.  Try deleting thus

*rm -rf ~/.m2/repository/com/ibm*


On Tue, Aug 27, 2013 at 7:37 AM, Stevo Slavić ssla...@gmail.com wrote:

 Hello Michael,

 Seems like temporary Maven Central repo mirror(s) issue. I've just tried
 several times to open with browser
 http://repo1.maven.org/maven2/org/apache/maven/plugins/ and sometimes it
 responds well, and few times it returns empty page.

 So, please try again.

 Kind regards,
 Stevo Slavic.


 On Tue, Aug 27, 2013 at 3:59 PM, Michael Wechner
 michael.wech...@wyona.comwrote:

  Hi
 
  I have downloaded
 
  http://mirror.switch.ch/**mirror/apache/dist/mahout/0.8/**
  mahout-distribution-0.8-src.**zip
 http://mirror.switch.ch/mirror/apache/dist/mahout/0.8/mahout-distribution-0.8-src.zip
 
 
  and tried to build it with
 
  mvn -DskipTests clean install
 
  on Mac OS X 10.6.8 with Java 1.6.0_45 and Maven 3.0.4
 
  but reveived the following errors:
 
  [INFO] --**--**
  
  [INFO] Reactor Summary:
  [INFO]
  [INFO] Mahout Build Tools ..**.. SUCCESS
  [13.168s]
  [INFO] Apache Mahout ..**... SUCCESS
  [2.823s]
  [INFO] Mahout Math ..**. SUCCESS
  [1:02.822s]
  [INFO] Mahout Core ..**. SUCCESS
  [1:26.430s]
  [INFO] Mahout Integration ..**.. FAILURE
  [1:45.435s]
  [INFO] Mahout Examples ..**. SKIPPED
  [INFO] Mahout Release Package  SKIPPED
  [INFO] --**--**
  
  [INFO] BUILD FAILURE
  [INFO] --**--**
  
  [INFO] Total time: 4:31.448s
  [INFO] Finished at: Tue Aug 27 15:19:31 CEST 2013
  [INFO] Final Memory: 27M/123M
  [INFO] --**--**
  
  [ERROR] Failed to execute goal on project mahout-integration: Could not
  resolve dependencies for project
 org.apache.mahout:mahout-**integration:jar:0.8:
  Could not transfer artifact com.ibm.icu:icu4j:jar:49.1 from/to central (
  http://repo.maven.apache.org/**maven2
 http://repo.maven.apache.org/maven2):
  GET request of: com/ibm/icu/icu4j/49.1/icu4j-**49.1.jar from central
  failed: Premature end of Content-Length delimited message body (expected:
  7407144; received: 4098921 - [Help 1]
  [ERROR]
  [ERROR] To see the full stack trace of the errors, re-run Maven with the
  -e switch.
  [ERROR] Re-run Maven using the -X switch to enable full debug logging.
  [ERROR]
  [ERROR] For more information about the errors and possible solutions,
  please read the following articles:
  [ERROR] [Help 1] http://cwiki.apache.org/**confluence/display/MAVEN/**
  DependencyResolutionException
 http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
 
  [ERROR]
  [ERROR] After correcting the problems, you can resume the build with the
  command
  [ERROR]   mvn goals -rf :mahout-integration
 
  Does anybody else experience the same problem?
 
  Thanks
 
  Michael
 



Re: Draft Interactive Viz for Exploring Co-occurrence, Recommender calculations

2013-08-19 Thread Ted Dunning
Yes.

Correlation is a problem because tables like

1 0
0 10^6

and

10 0
0 10^6

produce the same correlation.  LLR correctly distinguishes these cases.



On Mon, Aug 19, 2013 at 7:16 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Which is why LLR would be really nice in two action cross-similairty case.
 The cross-corelation sparsification via cooccurrence is probably pretty
 weak, no?


 On Aug 18, 2013, at 11:53 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Outside of the context of your demo, suppose that you have events a, b, c
 and d.  Event a is the one we are centered on and is relatively rare.
 Event b is not so rare, but has weak correlation with a.  Event c is as
 rare as a, but correlates strongly with it.  Even d is quite common, but
 has no correlation with a.

 The 2x2 matrices that you would get would look something like this.  In
 each of these, a and NOT a are in rows while other and NOT other are in
 columns.

 versus b, llrRoot = 8.03
   b NOT b  a *10* *10*  NOT a *1000* *99000*



 versus c, llrRoot = 11.5
  c NOT c  a *10* *10*  NOT a *30* *99970*



 versus d, llrRoot = 0
  d NOT d  a *10* *10*  NOT a *5* *5*

 Note that what we are holding constant here is the prevalence of a (20
 times) and the distribution of a under the conditions of the other symbol.
 What is being varied is the distribution of the other symbol in the NOT
 a case.




 On Sun, Aug 18, 2013 at 10:50 AM, B Lyon bradfl...@gmail.com wrote:

  Thanks folks for taking a look.
 
  I haven't sat down to try it yet, but wondering how hard it is to
 construct
  (realizable and realistic) k11, k12, k21, k22 values for three binary
  sequences X, Y, Z where (X,Y) and (Y,Z) have same co-occurrence, but you
  can tweak k12 and k21 so that the LLR values are extremely different in
  both directions.  I assume that k22 doesn't matter much in practice since
  things are sparse and k22 is huge.  Well, obviously, I guess you could
  simply switch the k12/k21 values between the two sequence pairs to flip
 the
  order at will... which is information that co-occurrence of course does
 not
  know about.
 
 
  On Sat, Aug 17, 2013 at 10:30 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  This is nice.  As you say, k11 is the only part that is used in
  cooccurrence and it doesn't weight by prevalence, either.
 
  This size analysis is hard to demonstrate much difference because it is
  hard to show interesting values of LLR without absurdly string
  coordination
  between items.
 
 
  On Fri, Aug 16, 2013 at 8:21 PM, B Lyon bradfl...@gmail.com wrote:
 
  As part of trying to get a better grip on recommenders, I have started
  a
  simple interactive visualization that begins with the raw data of
  user-item
  interactions and goes all the way to being able to twiddle the
  interactions
  in a test user vector to see the impact on recommended items.  This is
  for
  simple user interacted with an item case rather than numerical
  preferences for items.  The goal is to show the intermediate pieces and
  how
  they fit together via popup text on mouseovers and dynamic highlighting
  of
  the related pieces.  I am of course interested in feedback as I keep
  tweaking on it - not sure I got all the terminology quite right yet,
  for
  example, and might have missed some other things I need to know about.
  Note that this material is covered in Chapter 6.2 in MIA in the
  discussion
  on distributed recommenders.
 
  It's on googledrive here (very much a work-in-progress):
 
  https://googledrive.com/host/0B2GQktu-wcTiWHRwZFJacjlqODA/
 
  (apologies to small resolution screens)
 
  This is based only on the co-occurrence matrix, rather than including
  the
  other similarity measures, although in working through this, it seems
  that
  the other ones can just be interpreted as having alternative
  definitions
  of
  what * means in matrix multiplication of A^T*A, where A is the
  user-item
  matrix... and as an aside to me begs the interesting question of
  [purely
  hypotheticall?] situations where LLR and co-occurrence are at odds with
  each other in making recommendations, as co-occurrence seems to be just
  using the k11 term that is part of the LLR calculation.
 
  My goal (at the moment at least) is to eventually continue this for the
  solr-recommender project that started as few weeks ago, where we have
  the
  additional cross-matrix, as well as a kind of regrouping of pieces for
  solr.
 
 
  --
  BF Lyon
  http://www.nowherenearithaca.com
 
 
 
 
 
  --
  BF Lyon
  http://www.nowherenearithaca.com
 




Re: Setting up a recommender

2013-08-19 Thread Ted Dunning
Pat,

That really sounds great.

I should find some time (who needs sleep) to generate music logs for you as
well.


On Mon, Aug 19, 2013 at 8:31 AM, Pat Ferrel p...@occamsmachete.com wrote:

 There are three things I could work on my free time:

 1) test this on a bigger data set gathered from rotten tomatoes, which
 only has B data (movie thumbs up)
 2) begin work on the Solr query and service integration, rather than the
 current loose LucidWorks Search integration.
 3) make sure everything is set up for different item spaces in B and A.

 Planning to tackle in this order, unless someone speaks up.


 On Aug 16, 2013, at 1:39 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Works on a cluster but have only tested on the trivial test data set.

 On Aug 13, 2013, at 4:49 PM, Pat Ferrel p...@occamsmachete.com wrote:

 OK single action recs are working so output to Solr with only [B'B] and B.

 On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Corrections inline

  On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  I finally got some time to work on this and have a first cut at output
 to Solr working on the github repo. It only works on 2-action input but
 I'll have that cleaned up soon so it will work with one action. Solr
 indexing has not been tested yet and the field names and/or types may need
 tweaking.
 
  It takes the result of the previous drop:
  1) DRMs for B (user history or B items action1) and A (user history of A
 items action2)
  2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
  There are two final outputs created using mapreduce but requiring 2
 in-memory hashmaps. I think this will work on a cluster (the hashmaps are
 instantiated on each node) but haven't tried yet. It orders items in #2
 fields by strength of link, which is the similarity value used in [B'B]
 or [B'A]. It would be nice to order #1 by recency but there is no provision
 for passing through timestamps at present so they are ordered by the
 strength of preference. This is probably not useful and so can be ignored.
 Ordering by recency might be useful for truncating queries by recency while
 leaving the training data containing 100% of available history.
 
  1) It joins #1 DRMs to produce a single set of docs in CSV form, which
 looks like this:
  id,history_b,history_a
 u1,iphone ipad,iphone ipad galaxy
  ...
 
  2) it joins #2 DRMs to produce a single set of docs in CSV form, which
 looks like this:
  id,b_b_links,b_a_links
 iphone,iphone ipad,iphone ipad galaxy
  …
 
  It may work on a cluster, I haven't tried yet. As soon as someone has
 some large-ish sample log files I'll give them a try. Check the sample
 input files in the resources dir for format.
 
  https://github.com/pferrel/solr-recommender
 
 
  On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  When I started looking at this I was a bit skeptical. As a Search engine
 Solr may be peerless, but as yet another NoSQL db?
 
  However getting further into this I see one very large benefit. It has
 one feature that sets it completely apart from the typical NoSQL db. The
 type of queries you do return fuzzy results--in the very best sense of that
 word. The most interesting queries are based on similarity to some
 exemplar. Results are returned in order of similarity strength, not ordered
 by a sort field.
 
  Wherever similarity based queries are important I'll look at Solr first.
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's
 probably at least an alternative to using docs and CSVs to import the data
 from Mahout.
 
 
 
  On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  Yes.  That would be interesting.
 
 
 
 
  On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
  A little digression: Might a Matrix implementation backed by a Solr
 index
  and uses SolrJ for querying help at all for the Solr recommendation
  approach?
 
  It supports multiple fields of String, Text, or boolean flags.
 
  Best
  Gokhan
 
 
  On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com
 wrote:
 
  Also a question about user history.
 
  I was planning to write these into separate directories so Solr could
  fetch them from different sources but it occurs to me that it would be
  better to join A and B by user ID and output a doc per user ID with
 three
  fields, id, A item history, and B item history. Other fields could be
  added
  for users metadata.
 
  Sound correct? This is what I'll do unless someone stops me.
 
  On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  Once you have a sample or example of what you think the
  log file version will look like, can you post it? It would be great
 to
  have example lines for two actions with or without the same item IDs.
  I'll
  make sure we can digest it.
 
  I thought more about the ingest part and I don't think the
 one-item-space
  is actually a problem. It just

Re: Draft Interactive Viz for Exploring Co-occurrence, Recommender calculations

2013-08-18 Thread Ted Dunning
Outside of the context of your demo, suppose that you have events a, b, c
and d.  Event a is the one we are centered on and is relatively rare.
 Event b is not so rare, but has weak correlation with a.  Event c is as
rare as a, but correlates strongly with it.  Even d is quite common, but
has no correlation with a.

The 2x2 matrices that you would get would look something like this.  In
each of these, a and NOT a are in rows while other and NOT other are in
columns.

versus b, llrRoot = 8.03
   b NOT b  a *10* *10*  NOT a *1000* *99000*



versus c, llrRoot = 11.5
  c NOT c  a *10* *10*  NOT a *30* *99970*



versus d, llrRoot = 0
  d NOT d  a *10* *10*  NOT a *5* *5*

Note that what we are holding constant here is the prevalence of a (20
times) and the distribution of a under the conditions of the other symbol.
 What is being varied is the distribution of the other symbol in the NOT
a case.




On Sun, Aug 18, 2013 at 10:50 AM, B Lyon bradfl...@gmail.com wrote:

 Thanks folks for taking a look.

 I haven't sat down to try it yet, but wondering how hard it is to construct
 (realizable and realistic) k11, k12, k21, k22 values for three binary
 sequences X, Y, Z where (X,Y) and (Y,Z) have same co-occurrence, but you
 can tweak k12 and k21 so that the LLR values are extremely different in
 both directions.  I assume that k22 doesn't matter much in practice since
 things are sparse and k22 is huge.  Well, obviously, I guess you could
 simply switch the k12/k21 values between the two sequence pairs to flip the
 order at will... which is information that co-occurrence of course does not
 know about.


 On Sat, Aug 17, 2013 at 10:30 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  This is nice.  As you say, k11 is the only part that is used in
  cooccurrence and it doesn't weight by prevalence, either.
 
  This size analysis is hard to demonstrate much difference because it is
  hard to show interesting values of LLR without absurdly string
 coordination
  between items.
 
 
  On Fri, Aug 16, 2013 at 8:21 PM, B Lyon bradfl...@gmail.com wrote:
 
   As part of trying to get a better grip on recommenders, I have started
 a
   simple interactive visualization that begins with the raw data of
  user-item
   interactions and goes all the way to being able to twiddle the
  interactions
   in a test user vector to see the impact on recommended items.  This is
  for
   simple user interacted with an item case rather than numerical
   preferences for items.  The goal is to show the intermediate pieces and
  how
   they fit together via popup text on mouseovers and dynamic highlighting
  of
   the related pieces.  I am of course interested in feedback as I keep
   tweaking on it - not sure I got all the terminology quite right yet,
 for
   example, and might have missed some other things I need to know about.
Note that this material is covered in Chapter 6.2 in MIA in the
  discussion
   on distributed recommenders.
  
   It's on googledrive here (very much a work-in-progress):
  
   https://googledrive.com/host/0B2GQktu-wcTiWHRwZFJacjlqODA/
  
   (apologies to small resolution screens)
  
   This is based only on the co-occurrence matrix, rather than including
 the
   other similarity measures, although in working through this, it seems
  that
   the other ones can just be interpreted as having alternative
 definitions
  of
   what * means in matrix multiplication of A^T*A, where A is the
  user-item
   matrix... and as an aside to me begs the interesting question of
 [purely
   hypotheticall?] situations where LLR and co-occurrence are at odds with
   each other in making recommendations, as co-occurrence seems to be just
   using the k11 term that is part of the LLR calculation.
  
   My goal (at the moment at least) is to eventually continue this for the
   solr-recommender project that started as few weeks ago, where we have
 the
   additional cross-matrix, as well as a kind of regrouping of pieces for
   solr.
  
  
   --
   BF Lyon
   http://www.nowherenearithaca.com
  
 



 --
 BF Lyon
 http://www.nowherenearithaca.com



Re: Draft Interactive Viz for Exploring Co-occurrence, Recommender calculations

2013-08-17 Thread Ted Dunning
This is nice.  As you say, k11 is the only part that is used in
cooccurrence and it doesn't weight by prevalence, either.

This size analysis is hard to demonstrate much difference because it is
hard to show interesting values of LLR without absurdly string coordination
between items.


On Fri, Aug 16, 2013 at 8:21 PM, B Lyon bradfl...@gmail.com wrote:

 As part of trying to get a better grip on recommenders, I have started a
 simple interactive visualization that begins with the raw data of user-item
 interactions and goes all the way to being able to twiddle the interactions
 in a test user vector to see the impact on recommended items.  This is for
 simple user interacted with an item case rather than numerical
 preferences for items.  The goal is to show the intermediate pieces and how
 they fit together via popup text on mouseovers and dynamic highlighting of
 the related pieces.  I am of course interested in feedback as I keep
 tweaking on it - not sure I got all the terminology quite right yet, for
 example, and might have missed some other things I need to know about.
  Note that this material is covered in Chapter 6.2 in MIA in the discussion
 on distributed recommenders.

 It's on googledrive here (very much a work-in-progress):

 https://googledrive.com/host/0B2GQktu-wcTiWHRwZFJacjlqODA/

 (apologies to small resolution screens)

 This is based only on the co-occurrence matrix, rather than including the
 other similarity measures, although in working through this, it seems that
 the other ones can just be interpreted as having alternative definitions of
 what * means in matrix multiplication of A^T*A, where A is the user-item
 matrix... and as an aside to me begs the interesting question of [purely
 hypotheticall?] situations where LLR and co-occurrence are at odds with
 each other in making recommendations, as co-occurrence seems to be just
 using the k11 term that is part of the LLR calculation.

 My goal (at the moment at least) is to eventually continue this for the
 solr-recommender project that started as few weeks ago, where we have the
 additional cross-matrix, as well as a kind of regrouping of pieces for
 solr.


 --
 BF Lyon
 http://www.nowherenearithaca.com



Re: Install mahout 0.8 with hadoop 2.0

2013-08-14 Thread Ted Dunning
Honest feedback is always welcome on this mailing list.  Don't ever worry about 
flames for that.  

Don't forget that mr v1 is an option with hadoop 2. Confusing as that may be.  

Iterative algos are, as you say, very important.  My current inclination is to 
lean toward a downpour style of implementation. That fits well with yarn but it 
also actually fits reasonably with mr v1.  

Sent from my iPhone

On Aug 13, 2013, at 20:13, Carlos Mundi cmu...@gmail.com wrote:

 Anyway, I apologize if anyone takes offense.  None is meant, so please
 flame me off-list if you must.  But since I self-identify as a member of
 the small demand set Ted Dunning describes, I figure I can chime in.  As
 always, YMMV.


Re: Install mahout 0.8 with hadoop 2.0

2013-08-13 Thread Ted Dunning
No.  There is very small demand for Mahout on Hadoop 2.0 so far and the
forward/backward incompatibility of 2.0 has made it difficult to motivate
moving to 2.0.

The bigtop guys built a maven profile for 0.23 some time ago.  I don't know
the status of that.

I don't think that the differences are huge ... it is just the standard
Hadoop forklift-the-world upgrade experience.



On Tue, Aug 13, 2013 at 6:49 AM, Sergey Svinarchuk 
ssvinarc...@hortonworks.com wrote:

 Hi all,

 Somebody compile and install mahout with hadoop 2.0? If yes, that what
 changes you make in mahout, that it have 100% passed unit tests and
 successful work with hadoop 2.0?

 Thanks



Re: RowSimilarityJob, sampleDown method problem

2013-08-13 Thread Ted Dunning
Why do you think this?


On Tue, Aug 13, 2013 at 11:56 AM, sam wu swu5...@gmail.com wrote:

 Mahout 0.9 snapshot

 RowSimilarityJob.java , sampleDown method
 line 291 or 300

  double rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow)
 / observationsPerRow;

 return either 0.0 or 1.0, not fraction. needs (double) casting   


 BR

 Sam



Re: RowSimilarityJob, sampleDown method problem

2013-08-13 Thread Ted Dunning
Ouch.

Sorry... your original posting made it sound like you *wanted* it to be 0.0
or 1.0.

This is a bug.  Can you file a JIRA?


On Tue, Aug 13, 2013 at 12:04 PM, sam wu swu5...@gmail.com wrote:

 say column a has 1000 entries, maxPref=700
 rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow) /
 observationsPerRow;
 we get rowSampleRate =0.0 ( not 0.7)
 do we totally skip this column or sample column entries with .7 probalility
 (roughly get 700 entries)




 On Tue, Aug 13, 2013 at 11:58 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  Why do you think this?
 
 
  On Tue, Aug 13, 2013 at 11:56 AM, sam wu swu5...@gmail.com wrote:
 
   Mahout 0.9 snapshot
  
   RowSimilarityJob.java , sampleDown method
   line 291 or 300
  
double rowSampleRate = Math.min(maxObservationsPerRow,
  observationsPerRow)
   / observationsPerRow;
  
   return either 0.0 or 1.0, not fraction. needs (double) casting  
  
  
  
   BR
  
   Sam
  
 



Re: Help regarding Seq2sparse utility

2013-08-12 Thread Ted Dunning
Ah.

I get it.  Ish.

I think, but am not entirely sure that there are two outputs possible that
you might be seeing.

One is the centroids of the vectors themselves.  These tend to densify, but
I am not sure if these actually are dense vectors (I would tend to think
so).  That might be what you are seeing.

The second is the assignment of your original vectors to the nearest
cluster.  Here, the vector is just your original vector.  This output could
be in the form of a cluster id followed by the id's on all the vectors in
that cluster.  That doesn't look like what you are seeing.

Can you say what the actual commands you are running?  Without that, it is
a bit hard to say what you are seeing.






On Sun, Aug 11, 2013 at 10:57 PM, Ashwini P ashwini.a...@gmail.com wrote:

 Hi Ted,

 My apologies for not framing the question on clusterdumper properly. I am
 getting the output from clusterdumper in the expected format.  A sample
 vector from the  clusterdumper output is as shown below:

 1.0: /all-exchanges-strings.lc.txt = [amex:0.161, ase:0.161, asx:0.161,
 biffex:0.161, bse:0.161, cboe:0.161, cbt:0.161, cme:0.161, comex:0.161,
 cse:0.161, fox:0.136, fse:0.161, hkse:0.161, ipe:0.161, jse:0.161,
 klce:0.161, klse:0.161, liffe:0.161, lme:0.161, lse:0.161, mase:0.161,
 mise:0.161, mnse:0.161, mose:0.161, nasdaq:0.161, nyce:0.161, nycsce:0.161,
 nymex:0.161, nyse:0.161, ose:0.161, pse:0.161, set:0.136, simex:0.161,
 sse:0.161, stse:0.161, tose:0.161, tse:0.161, wce:0.161, zse:0.161]

 What I originally wanted to know is that are this vectors just the way
 clusterdumper prints them( i.e. are they dense vectors) or are they sparse
 vectors and  the clusterdumper iterates over the non-zero values and prints
 only those values. If they are sparse vectors, Can you kindly tell me in
 which directory are the vectors generated by the algorithm so I can read
 them.

 If the vectors are in dense format then I need to convert them to sparse
 vectors. As can be seen from the clusterdump outsput sample above,only the
 features which have non-zero values for each vector are being printed. the
 set of features which have non-zero values will differ from vector to
 vector. Consider we have 3 vectors f1,f2,f3 each with a set of nonzero
 features s1,s2 and s3 respectively. What I want is a set
  S={s1 U s2 U s3}
 i.e. S is the union of the sets of non-zero features for each vector so
 that I can convert the dense vectors to sparse vectors.

 Your thoughts on this are welcome.

 Thanks,
 Ashvini



 On Mon, Aug 12, 2013 at 10:55 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  Aside from your issues with clusterdumper, the values you want can be had
  from a sparse vector using v.iterateNonZero() and v.norm(0).
 
  The issue with clusterdumper is odd.
 
  Are you saying that the display shows all the components of the vector?
  Or
  that there is an in-memory representation that has been densified?
 
 
 
  On Sun, Aug 11, 2013 at 9:24 PM, Ashwini P ashwini.a...@gmail.com
 wrote:
 
   Hello,
  
   I am new to mahout. I want to know how I can get the list of features
  that
   where extracted from the corpus by seq2sparse and the count of the
 total
   number of features.
  
   My problem is that when I view the clustering output using
 clusterdumper
  I
   get only dense vectors  for each point that belongs in the cluster but
 I
   want the sparse vector for each point. What I want to know is that are
  the
   vectors output from the clustering algorithm stored as dense vector or
 is
   the clusterdumper  converting the vectors to dense vectors. If the
   clustering algorithm generates sparse vectors I can directly use them
 or
   else I will have to convert the vectors from dense to sparse for which
 I
   need the information mentioned in the above paragraph.
  
   Your suggestions on this are welcome.
  
   Thanks,
   Ashvini
  
 



Re: Clustering for customer segmentation

2013-08-12 Thread Ted Dunning
The tasks that you need to do include:

a) group your history by user id
b) extract the features you want to use from each user history
c) repeat clustering and adjusting the scaling of your features until you
are happy

If you have a few hundred examples of customers broken down by the
segmentation that you want, then one thing that you might look at is this
paper:

http://www.cs.cmu.edu/~epxing/papers/Old_papers/xing_nips02_metric.pdf

It shows a method for learning a metric that optimizes clustering of
labeled and unlabeled points.

Mahout currently does not have support for this kind of metric learning,
but it would make an excellent addition.



On Sat, Aug 10, 2013 at 11:54 AM, Martin, Nick nimar...@pssd.com wrote:

 Hi all,

 I'm new to Mahout and wondering if anyone could point me in the right
 direction for doing customer purchase behavior clustering in Mahout. Seems
 most of what I encounter in online and book examples for clustering is
 text/document based.

 Basically, I'd like to be able to explore passing n years of customer
 transaction data into one of the clustering algorithms and have my customer
 population be segmented into similar groups. Key determinants of similarity
 would be things like sales volume, purchase frequency, sales channel,
 profitability, tenure, category mix, etc.

 Anywhere I can see examples of this kind of thing?

 Thanks!!
 Nick



 Sent from my iPhone


Re: Clustering for customer segmentation

2013-08-12 Thread Ted Dunning
On Mon, Aug 12, 2013 at 12:52 PM, Martin, Nick nimar...@pssd.com wrote:

 I'd love to contribute so I'll get on JIRA and sign up for the dev@mailing 
 list to start getting a feel for that process.


Sounds like you already know the drill.

Welcome!


Re: Setting up a recommender

2013-08-12 Thread Ted Dunning
Yes.  That would be interesting.




On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:

 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?

 It supports multiple fields of String, Text, or boolean flags.

 Best
 Gokhan


 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  Also a question about user history.
 
  I was planning to write these into separate directories so Solr could
  fetch them from different sources but it occurs to me that it would be
  better to join A and B by user ID and output a doc per user ID with three
  fields, id, A item history, and B item history. Other fields could be
 added
  for users metadata.
 
  Sound correct? This is what I'll do unless someone stops me.
 
  On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  Once you have a sample or example of what you think the
  log file version will look like, can you post it? It would be great to
  have example lines for two actions with or without the same item IDs.
 I'll
  make sure we can digest it.
 
  I thought more about the ingest part and I don't think the one-item-space
  is actually a problem. It just means one item dictionary. A and B will
 have
  the right content, all I have to do is make sure the right ranks are
 input
  to the MM,
  Transpose, and RSJ. This in turn is only one extra count of the # of
 items
  in A's item space. This should be a very easy change If my thinking is
  correct.
 
 
  On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
   4) To add more metadata to the Solr output will be left to the consumer
   for now. If there is a good data set to use we can illustrate how to do
  it
   in the project. Ted may have some data for this from musicbrainz.
 
 
  I am working on this issue now.
 
  The current state is that I can bring in a bunch of track names and links
  to artist names and so on.  This would provide the basic set of items
  (artists, genres, tracks and tags).
 
  There is a hitch in bringing in the data needed to generate the logs
 since
  that part of MB is not Apache compatible.  I am working on that issue.
 
  Technically, the data is in a massively normalized relational form right
  now, but it isn't terribly hard to denormalize into a form that we need.
 
 
 



Re: Help regarding Seq2sparse utility

2013-08-11 Thread Ted Dunning
Aside from your issues with clusterdumper, the values you want can be had
from a sparse vector using v.iterateNonZero() and v.norm(0).

The issue with clusterdumper is odd.

Are you saying that the display shows all the components of the vector?  Or
that there is an in-memory representation that has been densified?



On Sun, Aug 11, 2013 at 9:24 PM, Ashwini P ashwini.a...@gmail.com wrote:

 Hello,

 I am new to mahout. I want to know how I can get the list of features that
 where extracted from the corpus by seq2sparse and the count of the total
 number of features.

 My problem is that when I view the clustering output using clusterdumper I
 get only dense vectors  for each point that belongs in the cluster but I
 want the sparse vector for each point. What I want to know is that are the
 vectors output from the clustering algorithm stored as dense vector or is
 the clusterdumper  converting the vectors to dense vectors. If the
 clustering algorithm generates sparse vectors I can directly use them or
 else I will have to convert the vectors from dense to sparse for which I
 need the information mentioned in the above paragraph.

 Your suggestions on this are welcome.

 Thanks,
 Ashvini



Re: Changing weightings in kmeans

2013-08-10 Thread Ted Dunning
Check out the streaming k-means code.

It provides capabilities for weighted samples.


On Sat, Aug 10, 2013 at 6:57 AM, William Moran echofo...@gmail.com wrote:

 Hi,

 How would I go about changing the weighting of certain words when preparing
 data for kmeans?

 Also, in clusterdumps I have already made, some of my clusters are marked
 'VL-' and some are 'CL-'. I believe this is to do with convergence, is it
 bad if the clusters have not converged and if so how can I ensure they do
 converge?

 Thanks

 (P.S. I did send a question similar to this a while ago but I'm not sure it
 worked)



Re: Setting new preferences on GenericBooleanPrefUserBasedRecommender

2013-08-09 Thread Ted Dunning
On Fri, Aug 9, 2013 at 12:30 PM, Matt Molek mpmo...@gmail.com wrote:

 From some local IR precision/recall testing, I've found that user based
 recommenders do better on my data, so I'd like to stick with user based if
 I can. I know precision/recall measures aren't always that important when
 dealing with recommendation, but in the case I'm using the recommender for,
 I think it's worth maximizing. I'm getting more than double the precision
 out of the user based recommenders.


What kind of user based recommender are you using?

Most competitive user based recommenders can be restated as item-based
recommenders.  Those are much easier to deploy.


Re: Is OnlineSummarizer mergeable?

2013-08-08 Thread Ted Dunning
I just looked at the source for QDigest from streamlib.

I think that the memory usage could be trimmed substantially, possibly by
as much as 5:1 by using more primitive friendly structures.



On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Hi Ted,

 I need percentiles.  Ideally not pre-defined ones, because one person may
 want e.g. 70th pctile, while somebody else might want 75th pctile for the
 same metric.

 Deal breakers:
 High memory footprint. (high means higher than QDigest from stream-lib
 for us and we could test and compare with QDigest relatively easily
 with live data)
 Algos that create data structures that cannot be merged
 Loss of accuracy that is not predictably small or configurable

 Thank you,
 Otis
 

 Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
 http://sematext.com/spm




 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org; Otis Gospodnetic 
 otis_gospodne...@yahoo.com
 Sent: Wednesday, August 7, 2013 11:48 PM
 Subject: Re: Is OnlineSummarizer mergeable?
 
 
 
 Otis,
 
 
 What statistics do you need?
 
 
 What guarantees?
 
 
 
 
 
 On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:
 
 Hi Ted,
 
 I'm actually trying to find an alternative to QDigest (the stream-lib
 impl specifically) because even though it seems good, we have to deal with
 crazy volumes of data in SPM (performance monitoring service, see
 signature)... I'm hoping we can find something that has both a lower memory
 footprint than QDigest AND that is mergeable a la QDigest.  Utopia?
 
 Thanks,
 Otis
 
 Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
 http://sematext.com/spm
 
 
 
 
 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Sent: Wednesday, August 7, 2013 4:51 PM
 Subject: Re: Is OnlineSummarizer mergeable?
 
 
 It isn't as mergeable as I would like.  If you have randomized record
 selection, it should be possible, but perverse ordering can cause
 serious
 errors.
 
 It would be better to use something like a Q-digest.
 
 http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
 
 
 
 
 On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com
  wrote:
 
  Hi,
 
  Is OnlineSummarizer algo mergeable?
 
  Say that we compute a percentile for some metric for time 12:00-12:01
  and store that somewhere, then we compute it for 1201-12:02 and store
  that separately, and so on.
 
  Can we then later merge these computed and previously stored
  percentile instances and get an accurate value?
 
  Thanks,
  Otis
  --
  Performance Monitoring -- http://sematext.com/spm
  Solr  ElasticSearch Support -- http://sematext.com/
 
 
 
 
 
 
 


Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-08 Thread Ted Dunning
That might slow down the job enormously for certain nasty inputs.

The more that I think about things, the more convinced I am that there
should be a post-processing pass to enforce things like not recommending
input items.  The recommendation algorithm itself should not be distorted
to do this if it is unnatural (and forcing a user to not use sampling is a
great example ... there should be two controls here).

I think that the original point is also correct, however.  The user should
not be forced to implement this very common step.  As such I think that the
recommender code should still support doing this, but it really ought to be
as an output filter.



On Wed, Aug 7, 2013 at 9:19 AM, Sebastian Schelter s...@apache.org wrote:

 if you also set --maxPrefsPerUserInItemSimilarity to a number higher than
 the max preferences per user, no sampling should occur. This might slow
 down the job however.

 2013/8/7 Rafal Lukawiecki ra...@projectbotticelli.com

  Is there a set of parameters which I could pass to RecommenderJob to
 avoid
  that random sampling, in order to create a test case for the issue I have
  experienced? Would setting --maxSimilaritiesPerItem and/or
  --maxPrefsPerUserInItemSimilarity help? Many thanks.
 
  On 7 Aug 2013, at 16:12, Sebastian Schelter ssc.o...@googlemail.com
   wrote:
 
  It could affect the results even in this case, as we also sample the
  preferences when computing similar items.
 
  On 07.08.2013 17:07, Rafal Lukawiecki wrote:
   Thank you, Sebastian. Would the random sampling affect the results of
  RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the
  actual, maximum number of preferences expressed by every user.
  
   Rafal
  
   On 7 Aug 2013, at 15:48, Sebastian Schelter ssc.o...@googlemail.com
   wrote:
  
   The code in trunk allows to you to specify a randomSeed, the older
   versions don't unfortunately.
  
   On 07.08.2013 16:35, Rafal Lukawiecki wrote:
   Hi Sebastian,
  
   The quantity of returned duplicates is much too large to be caused
  just by sampling's randomness. I wonder if this could be related to
  something that is platform-specific, as in Windows vs. *nix
 representation
  of input files, data types etc.
  
   For argument's sake, is it possible to fix the seed of the random
  aspect of the sampling so I could feed the same input through two
 platforms
  and compare the results?
  
   Rafal
  
   On 7 Aug 2013, at 15:20, Sebastian Schelter ssc.o...@googlemail.com
   wrote:
  
   Hi Rafal,
  
   this sounds really strange, the bug should not have anything to do
 with
   the version of Hadoop that you are running. You could sometimes not
 see
   it due to the random sampling of the preferences.
  
   --sebastian
  
   On 07.08.2013 13:53, Rafal Lukawiecki wrote:
   Sebastian,
  
   I've been doing a little more digging regarding the issue of
  preferences being calculated for already preferred items. I re-run the
 jobs
  using the same data and the same parameters on a different installation
 of
  Hadoop, and the problem seems to have gone away. For now it looks like
 the
  issue arises when I run it under Mahout 0.7 and 0.8 using HDP
 (Hortonworks
  Data Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does
 not
  show up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will
 work
  a little more to ensure my results, but if they stood up, should I still
  report it as a Mahout issue?
  
   Rafal
   --
   Rafal Lukawiecki
   Strategic Consultant and Director
   Project Botticelli Ltd
  
   On 1 Aug 2013, at 17:31, Sebastian Schelter s...@apache.org wrote:
  
   Setting it to the maximum number should be enough. Would be great if
  you
   can share your dataset and tests.
  
   2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com
  
   Should I have set that parameter to a value much much larger than
 the
   maximum number of actually expressed preferences by a user?
  
   I'm working on an anonymised data set. If it works as an error test
  case,
   I'd be happy to share it for your re-test. I am still hoping it is
 my
   error, not Mahout's.
  
   Rafal
   --
   Rafal Lukawiecki
   Pardon brevity, mobile device.
  
   On 1 Aug 2013, at 17:19, Sebastian Schelter s...@apache.org
 wrote:
  
   Ok, please file a bug report detailing what you've tested and what
   results
   you got.
  
   Just to clarify, setting maxPrefsPerUser to a high number still
 does
  not
   help? That surprises me.
  
  
   2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com
  
   Hi Sebastian,
  
   I've rechecked the results, and, I'm afraid that the issue has not
  gone
   away, contrary to my yesterday's enthusiastic response. Using 0.8
 I
  have
   retested with and without --maxPrefsPerUser 9000 parameter (no
 user
  has
   more than 5000 prefs). I have also supplied the prefs file,
 without
  the
   preference value, that is as: user,item (one per line) as a
   --filterFile,
   with and without the -maxPrefsPerUser, and 

Re: How to get human-readable output for large clustering?

2013-08-08 Thread Ted Dunning
Mahout is a library.  You can link against any version you like and still
have a perfectly valid Hadoop program.




On Wed, Aug 7, 2013 at 11:51 AM, Adam Baron adam.j.ba...@gmail.com wrote:

 Suneel,

 Unfortunately no, we're still on Mahout 0.7.  My team is one of many teams
 which share a large, centrally administrated Hadoop cluster.  The admins
 are pretty strict about only installing official CDH releases.  I don't
 believe Mahout 0.8 is in an official CDH release yet.  Has the
 ClusterDumper code changed in 0.8?

 Regards,
   Adam

 On Tue, Aug 6, 2013 at 9:00 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  Adam,
 
  Pardon my asking again if this has already been answered - Are you
 running
  against Mahout 0.8?
 
 
 
 
--
   *From:* Adam Baron adam.j.ba...@gmail.com
  *To:* user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com
  *Sent:* Tuesday, August 6, 2013 6:56 PM
 
  *Subject:* Re: How to get human-readable output for large clustering?
 
  Suneel,
 
  I was trying -n 25 and -b 100 when I sent my e-mail about it not working
  for me.  Just tried -n 20 and got the same error message.  Any other
 ideas?
 
  Thanks,
   Adam
 
  On Mon, Aug 5, 2013 at 7:40 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
  Adam/Florian,
 
  Could you try running the clusterdump by limiting the number of terms
 from
  clusterdump, by specifying -n 20 (outputs the 20 top terms)?
 
 
 
 
  
   From: Adam Baron adam.j.ba...@gmail.com
  To: user@mahout.apache.org
  Sent: Monday, August 5, 2013 8:03 PM
  Subject: Re: How to get human-readable output for large clustering?
 
 
  Florian,
 
  Any luck finding an answer over the past 5 months?  I'm also dealing with
  similar out of memory errors when I run clusterdump.  I'm using 50,000
  features and tried k=500.  The kmeans command ran fine, but then I got
 the
  dreaded OutOfMemory error on with the clusterdump command:
 
  2013-08-05 18:46:01,686 FATAL org.apache.hadoop.mapred.Child: Error
 running
  child : java.lang.OutOfMemoryError: Java heap space
at
 
 
 org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
at
 
 
 org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
at
 
 
 org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
at
  org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118)
at
 
 
 org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2114)
at
  org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2242)
at
 
 
 org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
at
 
 
 org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
at
 
 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at
 
 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at
  com.google.common.collect.Iterators$5.hasNext(Iterators.java:543)
at
 
 
 com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
at
 
 
 org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.getRepresentativePoints(RepresentativePointsMapper.java:103)
at
 
 
 org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.getRepresentativePoints(RepresentativePointsMapper.java:97)
at
 
 
 org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.setup(RepresentativePointsMapper.java:87)
at
  org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138)
at
  org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673)
at
  org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
at
  org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at
  java.security.AccessController.doPrivileged(Native Method)
at
  javax.security.auth.Subject.doAs(Subject.java:396)
at
 
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at
  org.apache.hadoop.mapred.Child.main(Child.java:262)
 
  Thanks,
  Adam
 
  On Mon, Mar 11, 2013 at 8:42 AM, Florian Laws flor...@florianlaws.de
  wrote:
 
   Hi,
  
   I have 

Re: Regarding starting up our project

2013-08-08 Thread Ted Dunning
If you are doing a student project, it may be best for you to do this as a
separate github project that *depends* on Mahout rather than trying to
build a modification to Mahout in the first instance.

The reasons that I say this include:

a) the Apache process will probably be foreign to you at first and will
significantly slow you down as a result.

b) the enthusiasm for your code by the community will depend very much on
whether you can convince us that your code will be high quality and you
will be around to help maintain it.  Purely because this is a student
project, you will have a very hard time doing this this.  That will also
slow down your progress.

c) the level of review for your code will be variable, but it you are able
to get reviews, they are likely to be more stringent than you are used to.
 This can be disheartening and, again, can slow you down.

d) the best route to guarantee the success for your school project is to
get something working well as soon as possible.  This implies that (a-c)
can seriously decrease your success rate.

Taking all of this together, what I suggest is that you start by developing
as a separate project.  This will let you get started instantly and make
progress immediately.  Being separate does not mean that you will lack
support from the Mahout community, you can still invite reviews and
commentary on your approach and your code.  All it means is that you won't
be slowed down by the whole community process and are more likely to have a
successful project.

If your project is successful and if your code fits into the Mahout style
and structure, then moving from a separate project into the mahout mainline
is relatively easy for a self-contained project like a neural network
implementation.

All of this said, you should look at the archives of the mailing list. Yexi
just recently put up some code to do much of what you suggest and you
should comment on the code review.  You should also decide how that code
affects your project.



On Wed, Aug 7, 2013 at 11:46 PM, Sushanth Bhat(MT2012147) 
sushanth.b...@iiitb.org wrote:

 Hi,

 We are planning to implement Neural networks algorithm in Mahout. We are
 doing as a part of Machine Learning course project. As we don't have much
 knowledge about Mahout, can anyone please help us how to get start with
 implementation of algorithm.

 Thanks and regards,
 Sushanth Bhat
 IIIT-Bangalore



Re: Regarding starting up our project

2013-08-08 Thread Ted Dunning
On Thu, Aug 8, 2013 at 1:31 PM, Sushanth Bhat(MT2012147) 
sushanth.b...@iiitb.org wrote:

 One more doubt I have that do we need to start our project without Mahout
 library, I mean just implementing algorithm?


I would suggest that Mahout would be very useful for your project.

Use Maven and include Mahout math as a dependency.  If you do a map-reduce
implementation of neural nets, add Mahout core as well.


Re: Is OnlineSummarizer mergeable?

2013-08-08 Thread Ted Dunning
I was about to point you at that pull request.  How droll.

Didn't know it was from you guys.


On Thu, Aug 8, 2013 at 3:35 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Hi Ted,

 Yes, that's what we did recently, too:
 https://github.com/clearspring/stream-lib/pull/47

 ... but it's still a little too phat...which is what made me think of your
 OnlineSummarizer as a possible, slimmer alternative.

 Otis
 
 Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
 http://sematext.com/spm




 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org; Otis Gospodnetic 
 otis_gospodne...@yahoo.com
 Sent: Thursday, August 8, 2013 8:27 AM
 Subject: Re: Is OnlineSummarizer mergeable?
 
 
 
 I just looked at the source for QDigest from streamlib.
 
 
 I think that the memory usage could be trimmed substantially, possibly by
 as much as 5:1 by using more primitive friendly structures.
 
 
 
 
 
 On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:
 
 Hi Ted,
 
 I need percentiles.  Ideally not pre-defined ones, because one person
 may want e.g. 70th pctile, while somebody else might want 75th pctile for
 the same metric.
 
 Deal breakers:
 High memory footprint. (high means higher than QDigest from
 stream-lib for us and we could test and compare with QDigest
 relatively easily with live data)
 Algos that create data structures that cannot be merged
 Loss of accuracy that is not predictably small or configurable
 
 Thank you,
 Otis
 
 
 Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
 http://sematext.com/spm
 
 
 
 
 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org; Otis
 Gospodnetic otis_gospodne...@yahoo.com
 Sent: Wednesday, August 7, 2013 11:48 PM
 Subject: Re: Is OnlineSummarizer mergeable?
 
 
 
 Otis,
 
 
 What statistics do you need?
 
 
 What guarantees?
 
 
 
 
 
 On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:
 
 Hi Ted,
 
 I'm actually trying to find an alternative to QDigest (the stream-lib
 impl specifically) because even though it seems good, we have to deal with
 crazy volumes of data in SPM (performance monitoring service, see
 signature)... I'm hoping we can find something that has both a lower memory
 footprint than QDigest AND that is mergeable a la QDigest.  Utopia?
 
 Thanks,
 Otis
 
 Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
 http://sematext.com/spm
 
 
 
 
 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Sent: Wednesday, August 7, 2013 4:51 PM
 Subject: Re: Is OnlineSummarizer mergeable?
 
 
 It isn't as mergeable as I would like.  If you have randomized record
 selection, it should be possible, but perverse ordering can cause
 serious
 errors.
 
 It would be better to use something like a Q-digest.
 
 http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
 
 
 
 
 On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com
  wrote:
 
  Hi,
 
  Is OnlineSummarizer algo mergeable?
 
  Say that we compute a percentile for some metric for time
 12:00-12:01
  and store that somewhere, then we compute it for 1201-12:02 and
 store
  that separately, and so on.
 
  Can we then later merge these computed and previously stored
  percentile instances and get an accurate value?
 
  Thanks,
  Otis
  --
  Performance Monitoring -- http://sematext.com/spm
  Solr  ElasticSearch Support -- http://sematext.com/
 
 
 
 
 
 
 
 
 
 


Re: Evaluating Precision and Recall of Various Similarity Metrics

2013-08-08 Thread Ted Dunning
Rafal,

The major problems with these sorts of metrics with recommendations include

a) different algorithms pull up different data and you don't have any
deeply scored reference data.  The problem is similar to search except
without test collections.  There are some partial solutions to this

b) recommendations are typically very strongly dependent on feedback from
data that they themselves sample.  This means, for instance, that a system
with dithering will often out-perform the same system without dithering.
 Dithering is a form of noise added to the result of a recommender so the
quality of the system with dithering logically has to be worse than the
system without.  The system with dithering performs much better, however,
because it is able to gather broader information and thus learns about
things that the version without dithering would never find.

Problem (b) is the strongly limiting case because dithering can make a
bigger change than almost any reasonable algorithmic choice.  Sadly,
problem (a) is the one attacked in most academic research.




On Thu, Aug 8, 2013 at 10:34 AM, Rafal Lukawiecki 
ra...@projectbotticelli.com wrote:

 Hi Sebastian—thank you for your suggestions, incl considering other
 similarity measures like LoglikelihoodRation. I still hope to do a
 comparison of all of the available ones, under our data. I realise the
 importance (and also some limitations) of A/B in production testing, but
 having a broader way to test the recommender would have been useful.

 I suppose, I am used to looking at lift/profit charts, cross-validation,
 RMSE etc metrics of accuracy and reliability when working with data mining
 models, such as decision trees or clustering, but also using this technique
 for association rules evaluation, where I'd be hoping that the model
 correctly predicts basket completions. I am curious if there is anything
 along this line of thinking for evaluating recommenders that do not expose
 explicit models.

 Many thanks, very much indeed, for all your replies.

 Rafal

 On 8 Aug 2013, at 17:58, Sebastian Schelter s...@apache.org
  wrote:

 Hi Rafal,

 you are right, unfortunately there is no tooling available for doing
 holdout tests with RecommenderJob. It would be an awesome contribution to
 Mahout though.

 Ideally, you would want to split your dataset in a way that you retain some
 portion of the interactions of each user and then see how much of the
 held-out interactions you can reproduce. You should be aware that this is
 basically a test of how good a recommender can reproduce what already
 happened. If you get recommendations for items that are not in your held
 out data, this does not automatically mean that they are wrong. They might
 be very interesting things that the user simply hasn't had a chance to look
 at yet. The real performance of a recommender can only be found via
 extensive A/B testing in production systems.

 Btw, I would strongly recommend that you use a more sophisticated
 similarity than cooccurrence count, e..g LoglikelihoodRation.

 Best,
 Sebastian


 2013/8/8 Rafal Lukawiecki ra...@projectbotticelli.com

  I'd like to compare the accuracy, precision and recall of various vector
  similarity measures with regards to our data sets. Ideally, I'd like to
 do
  that for RecommenderJob, including CooccurrenceCount. However, I don't
  think RecommenderJob supports calculation of the performance metrics.
 
  Alternatively, I could use the evaluator logic in the non-Hadoop-based
  Item-based recommenders, but they do not seem to support the option of
  using CooccurrenceCount as a measure, or am I wrong?
 
  Reading archived conversations from here, I can see others have asked a
  similar question in 2011 (
  http://comments.gmane.org/gmane.comp.apache.mahout.user/9758) but there
  seems no clear guidance. Also, I am unsure if it is valid to split the
 data
  set into training/testing that way, as testing users' key characteristic
 is
  the items they have preferred—and there is no model to fit them to, so
 to
  speak, or they would become anonymous users if we stripped their
  preferences. Am I right in thinking that I could test RecommenderJob by
  feeding X random preferences of a user, having hidden the remainder of
  their preferences, and see if the hidden items/preferences would become
  their recommendations? However, that approach would change what a user
  likes (by hiding their preferences for testing purposes) and I'd be
  concerned about the value of the recommendation. Am I in a loop? Is
 there a
  way to somehow tap into the recommendation to get an accuracy metric out?
 
  Did anyone, perhaps, share a method or a script (R, Python, Java) for
  evaluating RecommenderJob results?
 
  Many thanks,
  Rafal Lukawiecki
 





Re: Arff files to Naive Bayes

2013-08-08 Thread Ted Dunning
On Wed, Aug 7, 2013 at 3:56 PM, John Meagher john.meag...@gmail.com wrote:

 Continuous values are being used now in addition to a large set of
 boolean flags.  I think I could convert the continuous values to some
 sort of bucketed values that could be represented as additional flags.
  If that was the case would the format need to be ...
 id1 flaga flagb
 id2 flagb flagc


Yes.


Re: Content-Based Recommendation Approaches

2013-08-07 Thread Ted Dunning
On Wed, Aug 7, 2013 at 7:29 AM, cont...@dhuebner.com wrote:

 This typically won't be fast enough if you have something like a random
 forest, but if your final targeting model is logistic regression, it
 probably will be fast enough.



 So usually I do need to train a custom model for each user independently?


Not necessarily.

Usually you need a global model that has user x item interaction variables.
 It isn't unusual to need a per user adjustment model, but if you can make
that rare, you can do better.

From the linear user x item interaction model, for instance, you may be
able to convert the model into a sparse weighted query that could retrieve
items from an inverted index such as Solr.  This might also be possible
with a per user model, but I would have to think about that.


Re: Is OnlineSummarizer mergeable?

2013-08-07 Thread Ted Dunning
It isn't as mergeable as I would like.  If you have randomized record
selection, it should be possible, but perverse ordering can cause serious
errors.

It would be better to use something like a Q-digest.

http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf




On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

 Hi,

 Is OnlineSummarizer algo mergeable?

 Say that we compute a percentile for some metric for time 12:00-12:01
 and store that somewhere, then we compute it for 1201-12:02 and store
 that separately, and so on.

 Can we then later merge these computed and previously stored
 percentile instances and get an accurate value?

 Thanks,
 Otis
 --
 Performance Monitoring -- http://sematext.com/spm
 Solr  ElasticSearch Support -- http://sematext.com/



Re: Setting up a recommender

2013-08-07 Thread Ted Dunning
On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do it
 in the project. Ted may have some data for this from musicbrainz.


I am working on this issue now.

The current state is that I can bring in a bunch of track names and links
to artist names and so on.  This would provide the basic set of items
(artists, genres, tracks and tags).

There is a hitch in bringing in the data needed to generate the logs since
that part of MB is not Apache compatible.  I am working on that issue.

Technically, the data is in a massively normalized relational form right
now, but it isn't terribly hard to denormalize into a form that we need.


Re: Is OnlineSummarizer mergeable?

2013-08-07 Thread Ted Dunning
Otis,

What statistics do you need?

What guarantees?



On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Hi Ted,

 I'm actually trying to find an alternative to QDigest (the stream-lib impl
 specifically) because even though it seems good, we have to deal with crazy
 volumes of data in SPM (performance monitoring service, see signature)...
 I'm hoping we can find something that has both a lower memory footprint
 than QDigest AND that is mergeable a la QDigest.  Utopia?

 Thanks,
 Otis
 
 Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
 http://sematext.com/spm




 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Sent: Wednesday, August 7, 2013 4:51 PM
 Subject: Re: Is OnlineSummarizer mergeable?
 
 
 It isn't as mergeable as I would like.  If you have randomized record
 selection, it should be possible, but perverse ordering can cause serious
 errors.
 
 It would be better to use something like a Q-digest.
 
 http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
 
 
 
 
 On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com
  wrote:
 
  Hi,
 
  Is OnlineSummarizer algo mergeable?
 
  Say that we compute a percentile for some metric for time 12:00-12:01
  and store that somewhere, then we compute it for 1201-12:02 and store
  that separately, and so on.
 
  Can we then later merge these computed and previously stored
  percentile instances and get an accurate value?
 
  Thanks,
  Otis
  --
  Performance Monitoring -- http://sematext.com/spm
  Solr  ElasticSearch Support -- http://sematext.com/
 
 
 
 


Re: up-to-date book or tutorial

2013-08-07 Thread Ted Dunning
There is a considerable amount of discussion going on about a new edition
of Mahout in Action.


On Wed, Aug 7, 2013 at 12:36 PM, Piero Giacomelli pgiac...@gmail.comwrote:

 Basically all  my examples will be based on mahout 0.8. So for example the
 k-means clustering will be used with the updated version. I think that by
 the end of august the preorder will be available
  Il giorno 07/ago/2013 21:23, Suneel Marthi suneel_mar...@yahoo.com ha
 scritto:

  Congrats on the book, Piero.
 
  Would this book be based on Mahout 0.8 (and exclude stuff that has been
  marked as deprecated in 0.8)
 
 
 
 
  
   From: Piero Giacomelli pgiac...@gmail.com
  To: user@mahout.apache.org
  Sent: Wednesday, August 7, 2013 3:18 PM
  Subject: Re: up-to-date book or tutorial
 
 
  Packt will publish a cookbook on mahout in a couple of month
  Il giorno 06/ago/2013 10:53, Prasad, Girijesh g.pra...@ulster.ac.uk
 ha
  scritto:
 
   I am looking for an up-to-date book or tutorial. Is the Mahout in
 Action
   http://www.manning.com/owen/ the only best option? Earlier I saw a
   promotion code but I am unable to find any more. Please advise.
  
   With best wishes,
   Girijesh.
   -
  
  
  
   
  
   This email and any attachments are confidential and intended solely for
   the use of the addressee and may contain information which is covered
 by
   legal, professional or other privilege. If you have received this email
  in
   error please notify the system manager at postmas...@ulster.ac.uk and
   delete this email immediately. Any views or opinions expressed are
 solely
   those of the author and do not necessarily represent those of the
   University of Ulster. The University's computer systems may be
 monitored
   and communications carried out on them may be recorded to secure the
   effective operation of the system and for other lawful purposes. The
   University of Ulster does not guarantee that this email or any
  attachments
   are free from viruses or 100% secure. Unless expressly stated in the
 body
   of a separate attachment, the text of email is not intended to form a
   binding contract. Correspondence to and from the University may be
  subject
   to requests for disclosure by 3rd parties under relevant legislation.
 The
   University of Ulster was founded by Royal Charter in 1984 and is
  registered
   with company number RC000726 and VAT registered number GB672390524.The
   primary contact address for the University of Ulster in Northern
 Ireland
   is,Cromore Road, Coleraine, Co. Londonderry BT52 1SA
  



Re: Arff files to Naive Bayes

2013-08-07 Thread Ted Dunning
By non-text, do you mean continuous values?   Or sparse sets of tokens?

The general idea for Naive Bayes is that it requires input consisting of
sparse sets of tokens.



On Wed, Aug 7, 2013 at 2:00 PM, John Meagher john.meag...@gmail.com wrote:

 I'm just starting work with Mahout and I'm struggling getting an
 example of a non-text based Naive Bayes classifier up and running.
 The input will be feature vectors generated outside of Mahout.  As a
 test I'm using arff files (anything else CSV-ish will work).  I've
 been able to convert things into vectors in a few different ways, but
 can't figure out what is needed to get the trainnb command to work.

 Does the label index need to be generated through some manual process
 or something other than the arff.vector or trainnb command?

 Is there a specific format needed for the input arff files?  Specific
 columns in a specific order?


 Here's what I've tried so far in both 0.7 from CDH4 and 0.8 direct from
 Apache:

 $ wget http://repository.seasr.org/Datasets/UCI/arff/iris.arff
 $ mahout arff.vector --input iris.arff --output iris.model --dictOut
 iris.labels

 This works and seems to be right so far

 This is the command I think I need to train the Naive Bayes model.  It
 fails when creating the label index with the exception below.

 $ mahout trainnb -i iris.model/ -o iris.training -el -li
 iris.training.labels
 ...
 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 1
 at
 org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:123)
 at
 org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:180)
 at
 org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:94)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 ...


 Thanks for the help,
 John



Re: Is OnlineSummarizer mergeable?

2013-08-07 Thread Ted Dunning
Ouch.

You didn't mention accuracy.  I will assume a standard sort of 2-3%
accuracy or better and let you correct me if necessary.

I could meet all but one or two of those requirements several different
ways.

For instance, very high or low quantiles can be met with stacked min-sets
or max-sets.  The idea is that you keep the highest k values and the
highest k 10x downsampled data and so on.  This is pretty good for down to
the 90+%-ile (or up to the 10th %-ile).  This structure merges without loss
of accuracy.

For well-defined quantiles like 25-50-75, then the Mahout OnlineSummarizer
is excellent.  You can choose your arbitrary quantile ahead of time and you
can sometimes merge (but perverse data can kill you).

And then the QDigest.  It is, by definition, as big as a QDigest, but is
mergeable and allows any quantile. Also cool, is the fact that you can pick
the quantile late in the process.

Maybe the answer is to make the QDigest structure smaller.  How well is the
streamlib implementation cranked down?  Is it really tight?




On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Hi Ted,

 I need percentiles.  Ideally not pre-defined ones, because one person may
 want e.g. 70th pctile, while somebody else might want 75th pctile for the
 same metric.

 Deal breakers:
 High memory footprint. (high means higher than QDigest from stream-lib
 for us and we could test and compare with QDigest relatively easily
 with live data)
 Algos that create data structures that cannot be merged
 Loss of accuracy that is not predictably small or configurable

 Thank you,
 Otis
 

 Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
 http://sematext.com/spm




 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org; Otis Gospodnetic 
 otis_gospodne...@yahoo.com
 Sent: Wednesday, August 7, 2013 11:48 PM
 Subject: Re: Is OnlineSummarizer mergeable?
 
 
 
 Otis,
 
 
 What statistics do you need?
 
 
 What guarantees?
 
 
 
 
 
 On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:
 
 Hi Ted,
 
 I'm actually trying to find an alternative to QDigest (the stream-lib
 impl specifically) because even though it seems good, we have to deal with
 crazy volumes of data in SPM (performance monitoring service, see
 signature)... I'm hoping we can find something that has both a lower memory
 footprint than QDigest AND that is mergeable a la QDigest.  Utopia?
 
 Thanks,
 Otis
 
 Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
 http://sematext.com/spm
 
 
 
 
 
  From: Ted Dunning ted.dunn...@gmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Sent: Wednesday, August 7, 2013 4:51 PM
 Subject: Re: Is OnlineSummarizer mergeable?
 
 
 It isn't as mergeable as I would like.  If you have randomized record
 selection, it should be possible, but perverse ordering can cause
 serious
 errors.
 
 It would be better to use something like a Q-digest.
 
 http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
 
 
 
 
 On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com
  wrote:
 
  Hi,
 
  Is OnlineSummarizer algo mergeable?
 
  Say that we compute a percentile for some metric for time 12:00-12:01
  and store that somewhere, then we compute it for 1201-12:02 and store
  that separately, and so on.
 
  Can we then later merge these computed and previously stored
  percentile instances and get an accurate value?
 
  Thanks,
  Otis
  --
  Performance Monitoring -- http://sematext.com/spm
  Solr  ElasticSearch Support -- http://sematext.com/
 
 
 
 
 
 
 


Re: Content-Based Recommendation Approaches

2013-08-06 Thread Ted Dunning
On Tue, Aug 6, 2013 at 5:27 PM, Dominik Hübner cont...@dhuebner.com wrote:

 I wonder how model based approaches might be scaled to a large number of
 users. My understanding is that I would have to train some model like a
 decision tree or naive bayes (or regression … etc.)  for each user and do
 the prediction for each item using this model.

 Is there any common approach to get those techniques scaling up with
 larger datasets?


Yes.  There are several approaches.

One of the most effective is rescoring.  You use a performant recommender
such as a search engine based recommender and then rescore the top few
hundred items using a more detailed model.

This typically won't be fast enough if you have something like a random
forest, but if your final targeting model is logistic regression, it
probably will be fast enough.

In any case, there are also tricks you can pull in the evaluation of
certain classes of models.  For instance, with logistic regression, you can
remove the link function (doesn't change ordering) and you can ignore all
user specific features and weights (this doesn't change ordering either).
 This leaves you with a relatively small number of computations in the form
of a sparse by dense dot product.


Re: solr-recommender, recent changes to ToItemVectorsMapper

2013-08-05 Thread Ted Dunning
Concur here.  Obviously CrossRowSimilarityJob and RowSimilarityJob will be
able to share some down-stream code.  But there are economies in RSJ that
probably can't apply to CRSJ.



On Mon, Aug 5, 2013 at 7:20 AM, Sebastian Schelter s...@apache.org wrote:

 I think the downsampling belongs into RowSimilarityJob. But I also think
 that we need a special CrossRowSimilarityJob that computes B'A
 and also downsamples them during the computation. Furthermore it should
 compute LLR similarities between the rows not dot products.

 --sebastian

 On 05.08.2013 16:14, Pat Ferrel wrote:
  OK, iI see it in my build now. Also not sufficient repos in the pom.
 
  Looks like some major refactoring of RowSimilarity is in progress.
 
  Sebastian, are you sure downsampling belongs in RowSimilairty? It won't
 be applied to [B'A]?
 
  If so I'll update to the lastest Mahout trunk.
 
  On Aug 4, 2013, at 8:57 PM, B Lyon bradfl...@gmail.com wrote:
 
  Hi Pat
 
  Below is the compilation error - it's what led me to look at the
 SAMPLE_SIZE stuff in the first place, where I confirmed via javap that the
 downloaded mahout jar did not have it any more and then I started looking
 at the svn source.  Mebbe I've got something else misconfigured somehow,
 although I don't see how it would compile if it's looking for that static
 field that's removed.
 
  [ERROR] Failed to execute goal
 org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile
 (default-compile) on project solr-recommender: Compilation failure:
 Compilation failure:
  [ERROR]
 /Users/bradflyon/Documents/solr-recommender/src/main/java/finderbots/recommenders/hadoop/PrepareActionMatrixesJob.java:[120,71]
 cannot find symbol
  [ERROR] symbol  : variable SAMPLE_SIZE
  [ERROR] location: class
 org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper
  [ERROR]
 /Users/bradflyon/Documents/solr-recommender/src/main/java/finderbots/recommenders/hadoop/PrepareActionMatrixesJob.java:[168,71]
 cannot find symbol
  [ERROR] symbol  : variable SAMPLE_SIZE
  [ERROR] location: class
 org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper
  [
 
 
  On Sun, Aug 4, 2013 at 8:57 PM, Pat Ferrel pat.fer...@gmail.com wrote:
  Just updated to today's Mahout trunk and everything works for me.
 
  Can you send me the error?
 
  Sebastian, do we really want this limit in RowSimilairty? It will not be
 applied to [B'A] unless you also do a mod to give us RowSimilairty on two
 matrices. Now that would be very nice indeed…
 
  On Aug 3, 2013, at 9:48 PM, B Lyon bradfl...@gmail.com wrote:
 
  Hi Pat
 
  I was going to just play with building the sold-recommender stuff in its
 current wip state and noticed a compile error (running mvn install) I think
 because the 0.9 snapshot has some changes on July 30th
 
  http://svn.apache.org/viewvc?view=revisionrevision=1508302
 
  Basically, back on June 18, Ted noticed that the downsampling might not
 be being done at the right place to actually avoid overwork due to
 perversely prolific users (thread is here:
 http://web.archiveorange.com/archive/v/z6zxQatCzHoFxbdLF0of), and someone
 else (Sebastian Schelter) has already acted on this (July 30) to move the
 downsampling to somewhere else (Mahout-1289 -
 https://issues.apache.org/jira/browse/MAHOUT-1289), which (among other
 things) removes the SAMPLE_SIZE static variable from ToItemVectorsMapper.
  I don't know how the general changes affect what you were setting
 up/playing with.  Let me know if I've missed something here.
 
 
 




<    1   2   3   4   5   6   7   8   9   10   >