Re: extract multi-features for one solr feature extractor in solr learning to rank

2017-04-18 Thread Jianxiong Dong
Hi, Michael,
 Thank for very valuable feedbacks.

> You can pass in different params in the
> features.json config for each feature, even though they use the same
> feature class.
I used this idea to extract some features in this paper
(https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/letor3.pdf)
e.g.
Table 2 (1-15) features are just <query, doc> term features in various forms.

{
"store" : "MyFeatureStore",
"name" : "term_count_1",
"class" : "com.apache.solr.ltr.feature.TermCountFeature",
"params" : {
   "field" : "a_text",
   "terms" : "${user_terms}",
   "method"  : "1"
}
  },

{
"store" : "MyFeatureStore",
"name" : "term_count_2",
"class" : "com.apache.solr.ltr.feature.TermCountFeature",
"params" : {
   "field" : "a_text",
   "terms" : "${user_terms}",
   "method"  : "2"
}
  },

where method id corresponds to features on Table 2 (1-15).  Although
those features share the same class,  the differences are minor.  In
product deployment, this overhead may not be an issue. After feature
selection, probably only a small number of features are useful.

Another use case:
use convolution neural network or LSTM to extract embedded feature
vector for  both query and document, where dimension of the embedded
feature vectors should be 50-100. Then we feed those features into
learning-to-rank models.

> Your performance point about 100 features vs 1 feature is true,
> and pull requests to improve the plugin's performance and usability would
I will do some performance benchmark for some user cases to justify
whether supporting new multi-features for one feature class is worthy.
If yes, I will share the results and create pull request.

Thanks

Jianxiong

On 4/18/17, Michael Nilsson <mnilsson2...@gmail.com> wrote:
> Hi Jianxiong,
>
> What you say is true.  If you want 100 different feature values extracted,
> you need to specify 100 different features in the
> features.json config so that there is a direct mapping of features in and
> features out.  However, you more than likely need
> to only implement 1 feature class that you will use for those 100 feature
> values.  You can pass in different params in the
> features.json config for each feature, even though they use the same
> feature class.  In some cases you might be able to
> just have 1 feature output 1 value that changes per document, if you can
> collapse those features together.  This 2nd option
> may or may not work for you depending on your data, what you are trying to
> bucket, and what algorithm you are trying to
> use because not all algorithms can easily handle this case.  To illustrate:
>
>
> *A) Multiple binary features using the same 1 class*
> {
> "name" : "isProductCheap",
> "class" : "org.apache.solr.ltr.feature.SolrFeature",
> "params" : {
>   "fq": [ "price:[0 TO 100]" ]
> }
> },{
> "name" : "isProductExpensive",
> "class" : "org.apache.solr.ltr.feature.SolrFeature",
> "params" : {
>   "fq": [ "price:[101 TO 1000]" ]
> }
> },{
> "name" : "isProductCrazyExpensive",
> "class" : "org.apache.solr.ltr.feature.SolrFeature",
> "params" : {
>   "fq": [ "price:[1001 TO *]" ]
> }
> }
>
>
> *B) 1 feature that outputs different values (some algorithms don't handle
> discrete features well)*
> {
> "name" : "productPricePoint",
> "class" : "org.apache.solr.ltr.feature.MyPricePointFeature",
> "params" : {
>
>   // Either hard code price map in MyPricePointFeature.java, or
>   // pass it in through params for flexible customization,
>   // and return different values for cheap, expensive, and
> crazyExpensive
>
> }
> }
>
> The 2 options above satisfy most use cases, which is what we were
> targeting.
> In my specific use case, I opted for option A,
> and wrote a simple script that generates the features.json so I wouldn't
> have to write 100 similar features by hand.  You
> also mentioned that you want to extract features sparsely.  You can change
> the configuration of the Feature Transformer
> <http://lucene.apache.org/solr/6_5_0/solr-ltr/org/apache/solr/ltr/response/transform/LTRFeatureLoggerTransformerFactory.html>
>
> to return features that actuall

extract multi-features for one solr feature extractor in solr learning to rank

2017-04-14 Thread Jianxiong Dong
Hi,
I found that solr learning-to-rank (LTR) supports only ONE feature
for a given feature extractor.

See interface:

https://github.com/apache/lucene-solr/blob/master/solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/Feature.java

Line (281, 282) (in FeatureScorer)
@Override
  public abstract float score() throws IOException;

I have a user case: given a , I like to extract multiple
features (e.g.  100 features.  In the current framework,  I have to
define 100 features in feature.json. Also more cost for scored doc
iterations).

I would like to have an interface:

public abstract Map score() throws IOException;

It helps support sparse vector feature.

Can anybody provide an insight?

Thanks

Jianxiong


solr learning_to_rank (normalizer) unmatched argument type issue

2017-03-31 Thread Jianxiong Dong
Hi,
I created a toy learning-to-rank model in solr in order to show the issues.

Feature.json
-
[
  {
"store" : "wikiFeatureStore",
"name" : "doc_len",
"class" : "org.apache.solr.ltr.feature.FieldLengthFeature",
"params" : {"field":"a_text"}
  },
  {
"store" : "wikiFeatureStore",
"name" : "rankScore",
"class" : "org.apache.solr.ltr.feature.OriginalScoreFeature",
"params" : {}
  }
]

model.json
---
{
  "store" : "wikiFeatureStore",
  "class" : "org.apache.solr.ltr.model.LinearModel",
  "name" : "wiki_qaModel",
  "features" : [
{ "name" : "doc_len",
  "norm" : {
  "class" : "org.apache.solr.ltr.norm.MinMaxNormalizer",
  "params" : {"min": "1.0", "max" : "113.8" }
  }
},
   { "name" : "rankScore",
  "norm" : {
  "class" : "org.apache.solr.ltr.norm.MinMaxNormalizer",
  "params" : {"min": "0.0", "max" : "49.60385" }
  }
}
   ],
  "params" : {
  "weights": {
   "doc_len": 0.322,
   "rankScore": 0.98
  }
   }
}

I could upload both feature and model  and performed re-ranking based
on the above model.   The issue was that when I stopped the solr
server and restarted it.
I got error message when I ran the same query to extract the features:
"Caused by: org.apache.solr.common.SolrException: Failed to create new
ManagedResource /schema/model-store of type
org.apache.solr.ltr.store.rest.ManagedModelStore due to:
java.lang.IllegalArgumentException: argument type mismatch
at 
org.apache.solr.rest.RestManager.createManagedResource(RestManager.java:700)
at 
org.apache.solr.rest.RestManager.addRegisteredResource(RestManager.java:666)
at org.apache.solr.rest.RestManager.access$300(RestManager.java:59)
at 
org.apache.solr.rest.RestManager$Registry.registerManagedResource(RestManager.java:231)
at 
org.apache.solr.ltr.store.rest.ManagedModelStore.registerManagedModelStore(ManagedModelStore.java:51)
at 
org.apache.solr.ltr.search.LTRQParserPlugin.inform(LTRQParserPlugin.java:124)
at 
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:719)
at org.apache.solr.core.SolrCore.init(SolrCore.java:931)
... 9 more
Caused by: java.lang.IllegalArgumentException: argument type mismatch
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.solr.util.SolrPluginUtils.invokeSetters(SolrPluginUtils.java:1077)
at org.apache.solr.ltr.norm.Normalizer.getInstance(Normalizer.java:49)
"

I found that the issue was related to
solr-6.4.2/server/solr/my_collection/conf/_schema_model-store.json
"
{
  "initArgs":{},
  "initializedOn":"2017-03-31T20:51:59.494Z",
  "updatedSinceInit":"2017-03-31T20:54:54.841Z",
  "managedList":[{
  "name":"wiki_qaModel",
  "class":"org.apache.solr.ltr.model.LinearModel",
  "store":"wikiFeatureStore",
  "features":[
{
  "name":"doc_len",
  "norm":{
"class":"org.apache.solr.ltr.norm.MinMaxNormalizer",
"params":{
  "min":1.0,
  "max":113.7862548828}}},
...
"

Here the data type  for "min'' and "max" are double. When I manually
changed them to string. Then everything worked as expected.

"
 "norm":{
"class":"org.apache.solr.ltr.norm.MinMaxNormalizer",
"params":{
  "min": "1.0",
  "max": "113.7862548828"}}},


Any insights into the above strange behavior?

Thanks

Jianxiong