Stat support in Mahout for dev

2014-02-09 Thread Dmitriy Lyubimov
Hi,

I vaguely remember mention of stat external dependencies in mahout. What
dependencies we would rely on if we wanted to sample say from inverse gamma
or wishart?

Thanks in advance .


Re: Learning to rank support in Mahout and Solr integration?

2014-02-09 Thread peng

Hi Dr Dunning,

Thanks a lot! I was trying to make the model generalizable enough, but 
I'm also afraid I may 'abuse' it a bit, Here is my existing solution:


1. wrap any scorer by a ValueSource (many out-of-the-box exists in 
lucene-solr, extensions are possible but they don't have to be 
registered with ValueSourceParser-they won't be used independently)
2. extend CustomScoreQuery to have a flat and straightforward 
explanation form. Use this as a wrapper of filters (As SubQ) and 
scorers (As FunctionQ)
3. write a converter to print flat explanation to Mahout-compatible 
vectors.
4. run a job to 'explain()' those ground truths on an index and dump 
the result vectors.

5. (optional) run other jobs to get not-content-based score vectors.
6. join them, feed into a classifier-regressor, do some model 
selections.
7. (from this point I haven't done anything) try to 'migrate' this 
model into another CustomScoreQuery, which has a strong scorer that 
ensemble features in the same way the model suggested.

8. push into Solr Cloud Server. Register with Qparser.

What I found to be hard:

1. explanation is kind of abusive, its only designed for manual 
tweaking. I constantly run into problems where 'explain()' 
implementation was look down upon by developers and code stubs are used 
to fill. Notably, ToParentBlockJoin won't show nested scores, and 
ToChildBlockJoin simply doesn't work.
2. There is no automatic way to 'migrate' model to ensemble query. 
Though I haven't proceed that far I'm already afraid of the difficulty.
3. As a NoSQL database optimized to the core in text processing, Solr 
extensions are totally not intuitive and hard to debug and maintain. We 
try to keep this part minimal but still get stagnated at some point.


Environment is build on CDH 5.0beta2 with YARN and Cloudera search 
(Solr 4.4), some bugs then force me to uninstall it and install Solr 
Cloud 4.6. I wonder if there are more 'out-of-the-box' solutions?


Yours Peng

On Sun 09 Feb 2014 05:53:20 PM EST, Ted Dunning wrote:

I think that this is a bit of an idiosyncratic model for learning to rank,
but it is a reasonably viable one.

It would be good to have a discussion of what you find hard or easy and
what you think is needed to make this work.

Let's talk.



On Sun, Feb 9, 2014 at 2:26 PM, peng  wrote:


This is what I believe to be a typical learning to rank model:

1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr
these are queries/function queries).
2. Test those scorers on a ground truth dataset. Generating feature
vectors for top-n results annotated by human.
3. Use an existing classifier/regressor (e.g. support vector ranking,
GBDT, random forest etc.) on those feature vectors to get a ranking model.
4. Export this ranking model back to Solr as a custom ensemble query (a
BooleanQuery with custom boosting factor for linear model, or a
CustomScoreQuery with custom scoring function for non-linear model), push
it to Solr server, register with QParser. Push it to production. End of.

But I didn't find this workflow quite easy to implement in mahout-solr
integration (is it discouraged for some reason?). Namely, there is no
pipeline from results of scorers to a Mahout-compatible vector form, and
there is no pipeline from ranking model back to ensemble query. (I only
found the lucene2seq class, and the upcoming recommendation support, which
don't quite fit into the scenario). So what's the best practice for easily
implementing a realtime, learning to rank search engine in this case? I've
worked in a bunch of startups and such appliance seems to be in high
demand. (Remember that solr-based collaborative filtering model proposed by
Dr Dunning? This is the content-based counterpart of it)

I'm looking forward to streamline this process to make my upcoming work
easier. I think Mahout/Solr is the undisputed instrument of choice due to
their scalability and machine learning background of many of their top
committers. Can we talk about it at some point?

Yours Peng





Re: Learning to rank support in Mahout and Solr integration?

2014-02-09 Thread Ted Dunning
I think that this is a bit of an idiosyncratic model for learning to rank,
but it is a reasonably viable one.

It would be good to have a discussion of what you find hard or easy and
what you think is needed to make this work.

Let's talk.



On Sun, Feb 9, 2014 at 2:26 PM, peng  wrote:

> This is what I believe to be a typical learning to rank model:
>
> 1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr
> these are queries/function queries).
> 2. Test those scorers on a ground truth dataset. Generating feature
> vectors for top-n results annotated by human.
> 3. Use an existing classifier/regressor (e.g. support vector ranking,
> GBDT, random forest etc.) on those feature vectors to get a ranking model.
> 4. Export this ranking model back to Solr as a custom ensemble query (a
> BooleanQuery with custom boosting factor for linear model, or a
> CustomScoreQuery with custom scoring function for non-linear model), push
> it to Solr server, register with QParser. Push it to production. End of.
>
> But I didn't find this workflow quite easy to implement in mahout-solr
> integration (is it discouraged for some reason?). Namely, there is no
> pipeline from results of scorers to a Mahout-compatible vector form, and
> there is no pipeline from ranking model back to ensemble query. (I only
> found the lucene2seq class, and the upcoming recommendation support, which
> don't quite fit into the scenario). So what's the best practice for easily
> implementing a realtime, learning to rank search engine in this case? I've
> worked in a bunch of startups and such appliance seems to be in high
> demand. (Remember that solr-based collaborative filtering model proposed by
> Dr Dunning? This is the content-based counterpart of it)
>
> I'm looking forward to streamline this process to make my upcoming work
> easier. I think Mahout/Solr is the undisputed instrument of choice due to
> their scalability and machine learning background of many of their top
> committers. Can we talk about it at some point?
>
> Yours Peng
>


Learning to rank support in Mahout and Solr integration?

2014-02-09 Thread peng

This is what I believe to be a typical learning to rank model:

1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr 
these are queries/function queries).
2. Test those scorers on a ground truth dataset. Generating feature 
vectors for top-n results annotated by human.
3. Use an existing classifier/regressor (e.g. support vector ranking, 
GBDT, random forest etc.) on those feature vectors to get a ranking model.
4. Export this ranking model back to Solr as a custom ensemble query (a 
BooleanQuery with custom boosting factor for linear model, or a 
CustomScoreQuery with custom scoring function for non-linear model), 
push it to Solr server, register with QParser. Push it to production. 
End of.


But I didn't find this workflow quite easy to implement in mahout-solr 
integration (is it discouraged for some reason?). Namely, there is no 
pipeline from results of scorers to a Mahout-compatible vector form, and 
there is no pipeline from ranking model back to ensemble query. (I only 
found the lucene2seq class, and the upcoming recommendation support, 
which don't quite fit into the scenario). So what's the best practice 
for easily implementing a realtime, learning to rank search engine in 
this case? I've worked in a bunch of startups and such appliance seems 
to be in high demand. (Remember that solr-based collaborative filtering 
model proposed by Dr Dunning? This is the content-based counterpart of it)


I'm looking forward to streamline this process to make my upcoming work 
easier. I think Mahout/Solr is the undisputed instrument of choice due 
to their scalability and machine learning background of many of their 
top committers. Can we talk about it at some point?


Yours Peng