Stat support in Mahout for dev
Hi, I vaguely remember mention of stat external dependencies in mahout. What dependencies we would rely on if we wanted to sample say from inverse gamma or wishart? Thanks in advance .
Re: Learning to rank support in Mahout and Solr integration?
Hi Dr Dunning, Thanks a lot! I was trying to make the model generalizable enough, but I'm also afraid I may 'abuse' it a bit, Here is my existing solution: 1. wrap any scorer by a ValueSource (many out-of-the-box exists in lucene-solr, extensions are possible but they don't have to be registered with ValueSourceParser-they won't be used independently) 2. extend CustomScoreQuery to have a flat and straightforward explanation form. Use this as a wrapper of filters (As SubQ) and scorers (As FunctionQ) 3. write a converter to print flat explanation to Mahout-compatible vectors. 4. run a job to 'explain()' those ground truths on an index and dump the result vectors. 5. (optional) run other jobs to get not-content-based score vectors. 6. join them, feed into a classifier-regressor, do some model selections. 7. (from this point I haven't done anything) try to 'migrate' this model into another CustomScoreQuery, which has a strong scorer that ensemble features in the same way the model suggested. 8. push into Solr Cloud Server. Register with Qparser. What I found to be hard: 1. explanation is kind of abusive, its only designed for manual tweaking. I constantly run into problems where 'explain()' implementation was look down upon by developers and code stubs are used to fill. Notably, ToParentBlockJoin won't show nested scores, and ToChildBlockJoin simply doesn't work. 2. There is no automatic way to 'migrate' model to ensemble query. Though I haven't proceed that far I'm already afraid of the difficulty. 3. As a NoSQL database optimized to the core in text processing, Solr extensions are totally not intuitive and hard to debug and maintain. We try to keep this part minimal but still get stagnated at some point. Environment is build on CDH 5.0beta2 with YARN and Cloudera search (Solr 4.4), some bugs then force me to uninstall it and install Solr Cloud 4.6. I wonder if there are more 'out-of-the-box' solutions? Yours Peng On Sun 09 Feb 2014 05:53:20 PM EST, Ted Dunning wrote: I think that this is a bit of an idiosyncratic model for learning to rank, but it is a reasonably viable one. It would be good to have a discussion of what you find hard or easy and what you think is needed to make this work. Let's talk. On Sun, Feb 9, 2014 at 2:26 PM, peng wrote: This is what I believe to be a typical learning to rank model: 1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr these are queries/function queries). 2. Test those scorers on a ground truth dataset. Generating feature vectors for top-n results annotated by human. 3. Use an existing classifier/regressor (e.g. support vector ranking, GBDT, random forest etc.) on those feature vectors to get a ranking model. 4. Export this ranking model back to Solr as a custom ensemble query (a BooleanQuery with custom boosting factor for linear model, or a CustomScoreQuery with custom scoring function for non-linear model), push it to Solr server, register with QParser. Push it to production. End of. But I didn't find this workflow quite easy to implement in mahout-solr integration (is it discouraged for some reason?). Namely, there is no pipeline from results of scorers to a Mahout-compatible vector form, and there is no pipeline from ranking model back to ensemble query. (I only found the lucene2seq class, and the upcoming recommendation support, which don't quite fit into the scenario). So what's the best practice for easily implementing a realtime, learning to rank search engine in this case? I've worked in a bunch of startups and such appliance seems to be in high demand. (Remember that solr-based collaborative filtering model proposed by Dr Dunning? This is the content-based counterpart of it) I'm looking forward to streamline this process to make my upcoming work easier. I think Mahout/Solr is the undisputed instrument of choice due to their scalability and machine learning background of many of their top committers. Can we talk about it at some point? Yours Peng
Re: Learning to rank support in Mahout and Solr integration?
I think that this is a bit of an idiosyncratic model for learning to rank, but it is a reasonably viable one. It would be good to have a discussion of what you find hard or easy and what you think is needed to make this work. Let's talk. On Sun, Feb 9, 2014 at 2:26 PM, peng wrote: > This is what I believe to be a typical learning to rank model: > > 1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr > these are queries/function queries). > 2. Test those scorers on a ground truth dataset. Generating feature > vectors for top-n results annotated by human. > 3. Use an existing classifier/regressor (e.g. support vector ranking, > GBDT, random forest etc.) on those feature vectors to get a ranking model. > 4. Export this ranking model back to Solr as a custom ensemble query (a > BooleanQuery with custom boosting factor for linear model, or a > CustomScoreQuery with custom scoring function for non-linear model), push > it to Solr server, register with QParser. Push it to production. End of. > > But I didn't find this workflow quite easy to implement in mahout-solr > integration (is it discouraged for some reason?). Namely, there is no > pipeline from results of scorers to a Mahout-compatible vector form, and > there is no pipeline from ranking model back to ensemble query. (I only > found the lucene2seq class, and the upcoming recommendation support, which > don't quite fit into the scenario). So what's the best practice for easily > implementing a realtime, learning to rank search engine in this case? I've > worked in a bunch of startups and such appliance seems to be in high > demand. (Remember that solr-based collaborative filtering model proposed by > Dr Dunning? This is the content-based counterpart of it) > > I'm looking forward to streamline this process to make my upcoming work > easier. I think Mahout/Solr is the undisputed instrument of choice due to > their scalability and machine learning background of many of their top > committers. Can we talk about it at some point? > > Yours Peng >
Learning to rank support in Mahout and Solr integration?
This is what I believe to be a typical learning to rank model: 1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr these are queries/function queries). 2. Test those scorers on a ground truth dataset. Generating feature vectors for top-n results annotated by human. 3. Use an existing classifier/regressor (e.g. support vector ranking, GBDT, random forest etc.) on those feature vectors to get a ranking model. 4. Export this ranking model back to Solr as a custom ensemble query (a BooleanQuery with custom boosting factor for linear model, or a CustomScoreQuery with custom scoring function for non-linear model), push it to Solr server, register with QParser. Push it to production. End of. But I didn't find this workflow quite easy to implement in mahout-solr integration (is it discouraged for some reason?). Namely, there is no pipeline from results of scorers to a Mahout-compatible vector form, and there is no pipeline from ranking model back to ensemble query. (I only found the lucene2seq class, and the upcoming recommendation support, which don't quite fit into the scenario). So what's the best practice for easily implementing a realtime, learning to rank search engine in this case? I've worked in a bunch of startups and such appliance seems to be in high demand. (Remember that solr-based collaborative filtering model proposed by Dr Dunning? This is the content-based counterpart of it) I'm looking forward to streamline this process to make my upcoming work easier. I think Mahout/Solr is the undisputed instrument of choice due to their scalability and machine learning background of many of their top committers. Can we talk about it at some point? Yours Peng