Re: "LLR with time"

2017-11-14 Thread Johannes Schulte
d > > then recommending items in that category all based on user behavior. Or > try > > a placement based on a single thing a user watched like “because you > > watched xyz you might like these”. Don’t just show the most popular > > categories for the user and recommend items

Re: "LLR with time"

2017-11-12 Thread Johannes Schulte
; was better to have specialized pages for what's new and hot rather than > because I had data saying it was bad to do. I have put a very weak > recommendation effect on the what's hot pages so that people tend to see > trending material that they like. That doesn't help on what's new pa

Re: "LLR with time"

2017-11-11 Thread Johannes Schulte
nt to hear White Christmas > > <https://www.youtube.com/watch?v=P8Ozdqzjigg> until the day after > > christmas > > at which point this becomes a really bad recommendation. To some degree, > > this can be partially dealt with by using temporal tags as indicators, >

Re: "LLR with time"

2017-11-11 Thread Johannes Schulte
ormal recommendations—so you can ask for hot in “electronics” if you know > categories, or hot "in-stock" items, or ... > > Still anomaly detection does sound like an interesting approach. > > > On Nov 10, 2017, at 3:13 PM, Johannes Schulte <johannes.schu...@gmail.com>

"LLR with time"

2017-11-10 Thread Johannes Schulte
Hi "all", I am wondering what would be the best way to incorporate event time information into the calculation of the G-Test. There is a claim here https://de.slideshare.net/tdunning/finding-changes-in-real-data saying "Time aware variant of G-Test is possible" I remember i experimented with

Re: Text clustering with hashing vector encoders

2014-03-21 Thread Johannes Schulte
good. The step for finding labels is still unclear to me. You use the Loglikelihood class on the original documents? How? Or do you mean the collocation job? Cheers, Frank On Thu, Mar 20, 2014 at 8:39 PM, Johannes Schulte johannes.schu...@gmail.com wrote: Hi Frank, we are using

Re: Text clustering with hashing vector encoders

2014-03-20 Thread Johannes Schulte
Hi Frank, we are using a very similar system in production. Hashing text like data to a 5 dimensional vector with two probes, and then applying tf-idf weighting. For IDF we dont keep a separate weight dictionary but just count the distinct training examples (documents) that have a non null

Re: OutOfMemoryError: Java Heap Space in DocumentProcessor.tokenizeDocuments

2014-02-22 Thread Johannes Schulte
1I would pass the memory parameters in the args array directly. The hadoop specific arguments must come before your custom arguments, so like this String[] args = new String[]{-Dmapreduce.map.memory.mb=12323,customOpt1 ToolRunner.run(..args) The tool runner takes care of putting the hadoop

Re: SGD classifier demo app

2014-02-03 Thread Johannes Schulte
Hi Frank, you are using the feature vector encoders which hash a combination of feature name and feature value to 2 (default) locations in the vector. The vector size you configured is 11 and this is imo very small to the possible combination of values you have for your data (education, marital,

Re: Item recommendation w/o users or preferences

2014-01-13 Thread Johannes Schulte
Hey, since you are already using basket analysis terms like support, confidence and lift it might be easier for you to think of the llr score as a better lift since it automatically puts a penalty on seldom items (you usually use support in classic mba for that). So, you would use the same 4

Re: Setting up a recommender

2013-08-05 Thread Johannes Schulte
we have a cross recommender in production for about 3 month now, with the difference that we use lucene to build indices from map reduce directly plus we do the same thing for 30+ customers, most of them with different input data structure (field names, values). we had something similar before

Re: Keeping track of revisions of models?

2013-07-18 Thread Johannes Schulte
hi, we are just keeping them in hdfs, one directory with timestamp per model and a meta file gathering some metrics like AUC, number of training examples, class distribution. This makes it easy to generate reports out of it on the fly, why this would be very hard with git (plus there is no added

FeatureVectorEncoder Framework Signatures

2013-05-28 Thread Johannes Schulte
Hi, right now the only way to use the encoders without Strings is with a byte array. Wouldn't it be helpful to allow to pass in offset and length for use cases where there's a reusable byte array at hand? There's a part of MIA devoted to speeding up the encoding and i think this would be a

Re: Which database should I use with Mahout

2013-05-21 Thread Johannes Schulte
Dunning ted.dunn...@gmail.com wrote: Johannes, Your summary is good. I would add that the precalculated recommendations can be large enough that the lookup becomes more expensive. Your point about staleness is very on-point. On Mon, May 20, 2013 at 10:15 PM, Johannes Schulte

Re: Which database should I use with Mahout

2013-05-21 Thread Johannes Schulte
in a real business, you are very lucky. The search engine approach handles (b) and (c) by nature which massively improves the likelihood of ever getting to examine (d). On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte johannes.schu...@gmail.com wrote: Thanks! Could you also add how to learn

Re: Which database should I use with Mahout

2013-05-20 Thread Johannes Schulte
I think Pat is just saying that time(history_lookup) (1) + time (recommendation_calculation) (2) time(precalc_lookop) (3) since 1 and 3 are assumed to be served by the same system class (key value store, db) with a single key and 2 0. ed is using a lot of information that is available at

Re: Clustering product views and sales

2013-05-06 Thread Johannes Schulte
Hi! As a starting point I remember this conversation containing both elements (although the reconstruction part is rather small, hint!) http://markmail.org/message/5cfewal3oyt6vw2k On Tue, May 7, 2013 at 1:00 AM, Dominik Hübner cont...@dhuebner.com wrote: One more thing for now @Ted: What do

Re: Is Feature Hashing appropriate for document to document similarity calculations?

2013-04-24 Thread Johannes Schulte
Hi Martin, i guess you should be fine with the StaticWordValueEncoder , following e.g. this discussion on this list, it is about clustering but matches some of your questions

Re: In-memory kmeans clustering

2013-04-09 Thread Johannes Schulte
Hi, this worked for me without having to fiddle with map reduce classes ListCluster initialClusters = new ArrayListCluster(); IterableVector dataPoints = Lists.newArrayList(); ClusterClassifier prior = new ClusterClassifier(initialClusters,

Re: In-memory kmeans clustering

2013-04-09 Thread Johannes Schulte
dataPoints can be in memory or from disk, and you can sample the dataPoints for initialClusters. On Tue, Apr 9, 2013 at 6:16 PM, Johannes Schulte johannes.schu...@gmail.com wrote: Hi, this worked for me without having to fiddle with map reduce classes ListCluster initialClusters = new

Re: Naive Bayes Classifier - Scores

2013-02-25 Thread Johannes Schulte
Hi, the score is the probability of the example belonging to the class but under independence assumptions and hence only useful to compare scores of different classes with each other (..more likely than..). Since it is meant to be a probability, it can range from 0 to 1. If you want to transform

Re: Implicit preferences

2013-02-11 Thread Johannes Schulte
the performance? Thanks for all the input! On Mon, Feb 11, 2013 at 7:20 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Sun, Feb 10, 2013 at 3:39 PM, Johannes Schulte johannes.schu...@gmail.com wrote: ... i am currently implementing a system of the same kind, LLR sparsified term-cooccurrence

Re: Implicit preferences

2013-02-11 Thread Johannes Schulte
:27 PM, Ken Krugler kkrugler_li...@transpac.comwrote: On Feb 11, 2013, at 1:57am, Johannes Schulte wrote: @Ken Thanks for the hints... I am coming from a payload based system so I am aware if them, however in the lucene 3.6 branch boosting and payloads didn't work together (if you set

Re: Implicit preferences

2013-02-10 Thread Johannes Schulte
Hi, i am currently implementing a system of the same kind, LLR sparsified term-cooccurrence vectors in lucene (since not a day goes by where i see Ted praising this). There are not only views and purchases, but also search terms, facets and a lot more textual information to be included in the

Re: Click probability prediction using Mahout. From model output to probability

2012-12-27 Thread Johannes Schulte
Hi Pavel, first of all i would include an intercept term in the model. This learns the proportion of examples in the training set. Second, for getting calibrated probabilities out of the downsampled model, I can think of two ways: 1. Use another set of input data to measure the observed maximum

Re: Click probability prediction using Mahout. From model output to probability

2012-12-27 Thread Johannes Schulte
Oops, hit enter to early... Just wanted to say that those are the two ways I'm thinking of right now since i got a similar challenge. I'm thankful for any suggestions or comments. Cheers, Johannes On Thu, Dec 27, 2012 at 3:13 PM, Johannes Schulte johannes.schu...@gmail.com wrote: Hi Pavel

Re: Clustering without hadoop

2012-11-12 Thread Johannes Schulte
Hi Florents, it just became different but still works without hdfs, i also had trouble getting the right classes together but here is something that will hopefully work correctly: DistanceMeasure measure = new CosineDistanceMeasure(); // ClusterUtils is no mahout class ListCluster

Re: Mix of Content Based and Collaborative Filtering

2012-11-05 Thread Johannes Schulte
, payloads (as of a while ago) were not accessed very efficiently. This can massively slow down scoring. On Mon, Nov 5, 2012 at 7:01 AM, shubham srivastava shubha...@gmail.com wrote: http://sujitpal.blogspot.in/2011/01/payloads-with-solr.html On Fri, Nov 2, 2012 at 12:13 PM, Johannes

Re: Mix of Content Based and Collaborative Filtering

2012-11-05 Thread Johannes Schulte
: On Mon, Nov 5, 2012 at 12:06 PM, Johannes Schulte johannes.schu...@gmail.com wrote: do you really mean payloads? Because i consider them part of the index as they are stored per position and can be accessed during scoring. I had the impression that they were not indexed

Re: K-Means as a surrogate for Matrix Factorization

2012-10-05 Thread Johannes Schulte
with your situation.) Sean On Fri, Oct 5, 2012 at 10:44 AM, Johannes Schulte johannes.schu...@gmail.com wrote: Hi! I got a question concerning a recommendation / classification problem which i originally wanted to solve with matrix factorization methods from taste / mahout

Re: K-Means as a surrogate for Matrix Factorization

2012-10-05 Thread Johannes Schulte
recommendations to avoid solving the reverse problem. On Fri, Oct 5, 2012 at 12:42 PM, Johannes Schulte johannes.schu...@gmail.com wrote: Sean, thanks for your input. It's more like 30 million users + id mapping for both items and users, but i could probably sample that to something that fits