Re: [More Like This] Query building

2016-04-12 Thread Scott Stults
Hi Alessandro,

It's not uncommon for Solr patches to remain uncommitted for months, even
years. In fact some never get merged. Don't let that discourage you!


k/r,
Scott

On Fri, Mar 11, 2016 at 11:49 AM, Alessandro Benedetti <
abenede...@apache.org> wrote:

> I start to feel that is not that easy to contribute improvements or small
> fix to Solr ( if they are not super interesting to the mass) .
> I think this one could be a good improvement in the MLT but I would love to
> discuss this with some committer.
> The patch is attached, it is there since months ago...
> Any feedback would be appreciated, I want to contribute, but I need some
> second opinions ...
>
> Cheers
>
> On 11 February 2016 at 13:48, Alessandro Benedetti 
> wrote:
>
> > Hi Guys,
> > is it possible to have any feedback ?
> > Is there any process to speed up bug resolution / discussions ?
> > just want to understand if the patch is not good enough, if I need to
> > improve it or simply no-one took a look ...
> >
> > https://issues.apache.org/jira/browse/LUCENE-6954
> >
> > Cheers
> >
> > On 11 January 2016 at 15:25, Alessandro Benedetti  >
> > wrote:
> >
> >> Hi guys,
> >> the patch seems fine to me.
> >> I didn't spend much more time on the code but I checked the tests and
> the
> >> pre-commit checks.
> >> It seems fine to me.
> >> Let me know ,
> >>
> >> Cheers
> >>
> >> On 31 December 2015 at 18:40, Alessandro Benedetti <
> abenede...@apache.org
> >> > wrote:
> >>
> >>> https://issues.apache.org/jira/browse/LUCENE-6954
> >>>
> >>> First draft patch available, I will check better the tests new year !
> >>>
> >>> On 29 December 2015 at 13:43, Alessandro Benedetti <
> >>> abenede...@apache.org> wrote:
> >>>
>  Sure, I will proceed tomorrow with the Jira and the simple patch +
>  tests.
> 
>  In the meantime let's try to collect some additional feedback.
> 
>  Cheers
> 
>  On 29 December 2015 at 12:43, Anshum Gupta 
>  wrote:
> 
> > Feel free to create a JIRA and put up a patch if you can.
> >
> > On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
> > abenede...@apache.org
> > > wrote:
> >
> > > Hi guys,
> > > While I was exploring the way we build the More Like This query, I
> > > discovered a part I am not convinced of :
> > >
> > >
> > >
> > > Let's see how we build the query :
> > > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> > >
> > > 1) we extract the terms from the interesting fields, adding them to
> > a map :
> > >
> > > Map termFreqMap = new HashMap<>();
> > >
> > > *( we lose the relation field-> term, we don't know anymore where
> > the term
> > > was coming ! )*
> > >
> > > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> > >
> > > 2) we build the queue that will contain the query terms, at this
> > point we
> > > connect again there terms to some field, but :
> > >
> > > ...
> > >> // go through all the fields and find the largest document
> frequency
> > >> String topField = fieldNames[0];
> > >> int docFreq = 0;
> > >> for (String fieldName : fieldNames) {
> > >>   int freq = ir.docFreq(new Term(fieldName, word));
> > >>   topField = (freq > docFreq) ? fieldName : topField;
> > >>   docFreq = (freq > docFreq) ? freq : docFreq;
> > >> }
> > >> ...
> > >
> > >
> > > We identify the topField as the field with the highest document
> > frequency
> > > for the term t .
> > > Then we build the termQuery :
> > >
> > > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq,
> tf));
> > >
> > > In this way we lose a lot of precision.
> > > Not sure why we do that.
> > > I would prefer to keep the relation between terms and fields.
> > > The MLT query can improve a lot the quality.
> > > If i run the MLT on 2 fields : *description* and *facilities* for
> > example.
> > > It is likely I want to find documents with similar terms in the
> > > description and similar terms in the facilities, without mixing up
> > the
> > > things and loosing the semantic of the terms.
> > >
> > > Let me know your opinion,
> > >
> > > Cheers
> > >
> > >
> > > --
> > > --
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
> >
> >
> > --
> > Anshum Gupta
> >
> 
> 
> 
>  --
>  --
> 
>  Benedetti Alessandro
>  Visiting card : 

Re: [More Like This] Query building

2016-03-11 Thread Alessandro Benedetti
I start to feel that is not that easy to contribute improvements or small
fix to Solr ( if they are not super interesting to the mass) .
I think this one could be a good improvement in the MLT but I would love to
discuss this with some committer.
The patch is attached, it is there since months ago...
Any feedback would be appreciated, I want to contribute, but I need some
second opinions ...

Cheers

On 11 February 2016 at 13:48, Alessandro Benedetti 
wrote:

> Hi Guys,
> is it possible to have any feedback ?
> Is there any process to speed up bug resolution / discussions ?
> just want to understand if the patch is not good enough, if I need to
> improve it or simply no-one took a look ...
>
> https://issues.apache.org/jira/browse/LUCENE-6954
>
> Cheers
>
> On 11 January 2016 at 15:25, Alessandro Benedetti 
> wrote:
>
>> Hi guys,
>> the patch seems fine to me.
>> I didn't spend much more time on the code but I checked the tests and the
>> pre-commit checks.
>> It seems fine to me.
>> Let me know ,
>>
>> Cheers
>>
>> On 31 December 2015 at 18:40, Alessandro Benedetti > > wrote:
>>
>>> https://issues.apache.org/jira/browse/LUCENE-6954
>>>
>>> First draft patch available, I will check better the tests new year !
>>>
>>> On 29 December 2015 at 13:43, Alessandro Benedetti <
>>> abenede...@apache.org> wrote:
>>>
 Sure, I will proceed tomorrow with the Jira and the simple patch +
 tests.

 In the meantime let's try to collect some additional feedback.

 Cheers

 On 29 December 2015 at 12:43, Anshum Gupta 
 wrote:

> Feel free to create a JIRA and put up a patch if you can.
>
> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
> abenede...@apache.org
> > wrote:
>
> > Hi guys,
> > While I was exploring the way we build the More Like This query, I
> > discovered a part I am not convinced of :
> >
> >
> >
> > Let's see how we build the query :
> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> >
> > 1) we extract the terms from the interesting fields, adding them to
> a map :
> >
> > Map termFreqMap = new HashMap<>();
> >
> > *( we lose the relation field-> term, we don't know anymore where
> the term
> > was coming ! )*
> >
> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> >
> > 2) we build the queue that will contain the query terms, at this
> point we
> > connect again there terms to some field, but :
> >
> > ...
> >> // go through all the fields and find the largest document frequency
> >> String topField = fieldNames[0];
> >> int docFreq = 0;
> >> for (String fieldName : fieldNames) {
> >>   int freq = ir.docFreq(new Term(fieldName, word));
> >>   topField = (freq > docFreq) ? fieldName : topField;
> >>   docFreq = (freq > docFreq) ? freq : docFreq;
> >> }
> >> ...
> >
> >
> > We identify the topField as the field with the highest document
> frequency
> > for the term t .
> > Then we build the termQuery :
> >
> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
> >
> > In this way we lose a lot of precision.
> > Not sure why we do that.
> > I would prefer to keep the relation between terms and fields.
> > The MLT query can improve a lot the quality.
> > If i run the MLT on 2 fields : *description* and *facilities* for
> example.
> > It is likely I want to find documents with similar terms in the
> > description and similar terms in the facilities, without mixing up
> the
> > things and loosing the semantic of the terms.
> >
> > Let me know your opinion,
> >
> > Cheers
> >
> >
> > --
> > --
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> Anshum Gupta
>



 --
 --

 Benedetti Alessandro
 Visiting card : http://about.me/alessandro_benedetti

 "Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?"

 William Blake - Songs of Experience -1794 England

>>>
>>>
>>>
>>> --
>>> --
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience 

Re: [More Like This] Query building

2016-02-11 Thread Alessandro Benedetti
Hi Guys,
is it possible to have any feedback ?
Is there any process to speed up bug resolution / discussions ?
just want to understand if the patch is not good enough, if I need to
improve it or simply no-one took a look ...

https://issues.apache.org/jira/browse/LUCENE-6954

Cheers

On 11 January 2016 at 15:25, Alessandro Benedetti 
wrote:

> Hi guys,
> the patch seems fine to me.
> I didn't spend much more time on the code but I checked the tests and the
> pre-commit checks.
> It seems fine to me.
> Let me know ,
>
> Cheers
>
> On 31 December 2015 at 18:40, Alessandro Benedetti 
> wrote:
>
>> https://issues.apache.org/jira/browse/LUCENE-6954
>>
>> First draft patch available, I will check better the tests new year !
>>
>> On 29 December 2015 at 13:43, Alessandro Benedetti > > wrote:
>>
>>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>>>
>>> In the meantime let's try to collect some additional feedback.
>>>
>>> Cheers
>>>
>>> On 29 December 2015 at 12:43, Anshum Gupta 
>>> wrote:
>>>
 Feel free to create a JIRA and put up a patch if you can.

 On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
 abenede...@apache.org
 > wrote:

 > Hi guys,
 > While I was exploring the way we build the More Like This query, I
 > discovered a part I am not convinced of :
 >
 >
 >
 > Let's see how we build the query :
 > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
 >
 > 1) we extract the terms from the interesting fields, adding them to a
 map :
 >
 > Map termFreqMap = new HashMap<>();
 >
 > *( we lose the relation field-> term, we don't know anymore where the
 term
 > was coming ! )*
 >
 > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
 >
 > 2) we build the queue that will contain the query terms, at this
 point we
 > connect again there terms to some field, but :
 >
 > ...
 >> // go through all the fields and find the largest document frequency
 >> String topField = fieldNames[0];
 >> int docFreq = 0;
 >> for (String fieldName : fieldNames) {
 >>   int freq = ir.docFreq(new Term(fieldName, word));
 >>   topField = (freq > docFreq) ? fieldName : topField;
 >>   docFreq = (freq > docFreq) ? freq : docFreq;
 >> }
 >> ...
 >
 >
 > We identify the topField as the field with the highest document
 frequency
 > for the term t .
 > Then we build the termQuery :
 >
 > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
 >
 > In this way we lose a lot of precision.
 > Not sure why we do that.
 > I would prefer to keep the relation between terms and fields.
 > The MLT query can improve a lot the quality.
 > If i run the MLT on 2 fields : *description* and *facilities* for
 example.
 > It is likely I want to find documents with similar terms in the
 > description and similar terms in the facilities, without mixing up the
 > things and loosing the semantic of the terms.
 >
 > Let me know your opinion,
 >
 > Cheers
 >
 >
 > --
 > --
 >
 > Benedetti Alessandro
 > Visiting card : http://about.me/alessandro_benedetti
 >
 > "Tyger, tyger burning bright
 > In the forests of the night,
 > What immortal hand or eye
 > Could frame thy fearful symmetry?"
 >
 > William Blake - Songs of Experience -1794 England
 >



 --
 Anshum Gupta

>>>
>>>
>>>
>>> --
>>> --
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>>
>>
>> --
>> --
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: [More Like This] Query building

2016-01-11 Thread Alessandro Benedetti
Hi guys,
the patch seems fine to me.
I didn't spend much more time on the code but I checked the tests and the
pre-commit checks.
It seems fine to me.
Let me know ,

Cheers

On 31 December 2015 at 18:40, Alessandro Benedetti 
wrote:

> https://issues.apache.org/jira/browse/LUCENE-6954
>
> First draft patch available, I will check better the tests new year !
>
> On 29 December 2015 at 13:43, Alessandro Benedetti 
> wrote:
>
>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>>
>> In the meantime let's try to collect some additional feedback.
>>
>> Cheers
>>
>> On 29 December 2015 at 12:43, Anshum Gupta 
>> wrote:
>>
>>> Feel free to create a JIRA and put up a patch if you can.
>>>
>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>>> abenede...@apache.org
>>> > wrote:
>>>
>>> > Hi guys,
>>> > While I was exploring the way we build the More Like This query, I
>>> > discovered a part I am not convinced of :
>>> >
>>> >
>>> >
>>> > Let's see how we build the query :
>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>>> >
>>> > 1) we extract the terms from the interesting fields, adding them to a
>>> map :
>>> >
>>> > Map termFreqMap = new HashMap<>();
>>> >
>>> > *( we lose the relation field-> term, we don't know anymore where the
>>> term
>>> > was coming ! )*
>>> >
>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>>> >
>>> > 2) we build the queue that will contain the query terms, at this point
>>> we
>>> > connect again there terms to some field, but :
>>> >
>>> > ...
>>> >> // go through all the fields and find the largest document frequency
>>> >> String topField = fieldNames[0];
>>> >> int docFreq = 0;
>>> >> for (String fieldName : fieldNames) {
>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>>> >>   topField = (freq > docFreq) ? fieldName : topField;
>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>>> >> }
>>> >> ...
>>> >
>>> >
>>> > We identify the topField as the field with the highest document
>>> frequency
>>> > for the term t .
>>> > Then we build the termQuery :
>>> >
>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>>> >
>>> > In this way we lose a lot of precision.
>>> > Not sure why we do that.
>>> > I would prefer to keep the relation between terms and fields.
>>> > The MLT query can improve a lot the quality.
>>> > If i run the MLT on 2 fields : *description* and *facilities* for
>>> example.
>>> > It is likely I want to find documents with similar terms in the
>>> > description and similar terms in the facilities, without mixing up the
>>> > things and loosing the semantic of the terms.
>>> >
>>> > Let me know your opinion,
>>> >
>>> > Cheers
>>> >
>>> >
>>> > --
>>> > --
>>> >
>>> > Benedetti Alessandro
>>> > Visiting card : http://about.me/alessandro_benedetti
>>> >
>>> > "Tyger, tyger burning bright
>>> > In the forests of the night,
>>> > What immortal hand or eye
>>> > Could frame thy fearful symmetry?"
>>> >
>>> > William Blake - Songs of Experience -1794 England
>>> >
>>>
>>>
>>>
>>> --
>>> Anshum Gupta
>>>
>>
>>
>>
>> --
>> --
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: [More Like This] Query building

2015-12-31 Thread Alessandro Benedetti
https://issues.apache.org/jira/browse/LUCENE-6954

First draft patch available, I will check better the tests new year !

On 29 December 2015 at 13:43, Alessandro Benedetti 
wrote:

> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>
> In the meantime let's try to collect some additional feedback.
>
> Cheers
>
> On 29 December 2015 at 12:43, Anshum Gupta  wrote:
>
>> Feel free to create a JIRA and put up a patch if you can.
>>
>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>> abenede...@apache.org
>> > wrote:
>>
>> > Hi guys,
>> > While I was exploring the way we build the More Like This query, I
>> > discovered a part I am not convinced of :
>> >
>> >
>> >
>> > Let's see how we build the query :
>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>> >
>> > 1) we extract the terms from the interesting fields, adding them to a
>> map :
>> >
>> > Map termFreqMap = new HashMap<>();
>> >
>> > *( we lose the relation field-> term, we don't know anymore where the
>> term
>> > was coming ! )*
>> >
>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>> >
>> > 2) we build the queue that will contain the query terms, at this point
>> we
>> > connect again there terms to some field, but :
>> >
>> > ...
>> >> // go through all the fields and find the largest document frequency
>> >> String topField = fieldNames[0];
>> >> int docFreq = 0;
>> >> for (String fieldName : fieldNames) {
>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>> >>   topField = (freq > docFreq) ? fieldName : topField;
>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>> >> }
>> >> ...
>> >
>> >
>> > We identify the topField as the field with the highest document
>> frequency
>> > for the term t .
>> > Then we build the termQuery :
>> >
>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>> >
>> > In this way we lose a lot of precision.
>> > Not sure why we do that.
>> > I would prefer to keep the relation between terms and fields.
>> > The MLT query can improve a lot the quality.
>> > If i run the MLT on 2 fields : *description* and *facilities* for
>> example.
>> > It is likely I want to find documents with similar terms in the
>> > description and similar terms in the facilities, without mixing up the
>> > things and loosing the semantic of the terms.
>> >
>> > Let me know your opinion,
>> >
>> > Cheers
>> >
>> >
>> > --
>> > --
>> >
>> > Benedetti Alessandro
>> > Visiting card : http://about.me/alessandro_benedetti
>> >
>> > "Tyger, tyger burning bright
>> > In the forests of the night,
>> > What immortal hand or eye
>> > Could frame thy fearful symmetry?"
>> >
>> > William Blake - Songs of Experience -1794 England
>> >
>>
>>
>>
>> --
>> Anshum Gupta
>>
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


[More Like This] Query building

2015-12-29 Thread Alessandro Benedetti
Hi guys,
While I was exploring the way we build the More Like This query, I
discovered a part I am not convinced of :



Let's see how we build the query :
org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)

1) we extract the terms from the interesting fields, adding them to a map :

Map termFreqMap = new HashMap<>();

*( we lose the relation field-> term, we don't know anymore where the term
was coming ! )*

org.apache.lucene.queries.mlt.MoreLikeThis#createQueue

2) we build the queue that will contain the query terms, at this point we
connect again there terms to some field, but :

...
> // go through all the fields and find the largest document frequency
> String topField = fieldNames[0];
> int docFreq = 0;
> for (String fieldName : fieldNames) {
>   int freq = ir.docFreq(new Term(fieldName, word));
>   topField = (freq > docFreq) ? fieldName : topField;
>   docFreq = (freq > docFreq) ? freq : docFreq;
> }
> ...


We identify the topField as the field with the highest document frequency
for the term t .
Then we build the termQuery :

queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));

In this way we lose a lot of precision.
Not sure why we do that.
I would prefer to keep the relation between terms and fields.
The MLT query can improve a lot the quality.
If i run the MLT on 2 fields : *description* and *facilities* for example.
It is likely I want to find documents with similar terms in the description
and similar terms in the facilities, without mixing up the things and
loosing the semantic of the terms.

Let me know your opinion,

Cheers


-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: [More Like This] Query building

2015-12-29 Thread Anshum Gupta
Feel free to create a JIRA and put up a patch if you can.

On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti  wrote:

> Hi guys,
> While I was exploring the way we build the More Like This query, I
> discovered a part I am not convinced of :
>
>
>
> Let's see how we build the query :
> org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>
> 1) we extract the terms from the interesting fields, adding them to a map :
>
> Map termFreqMap = new HashMap<>();
>
> *( we lose the relation field-> term, we don't know anymore where the term
> was coming ! )*
>
> org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>
> 2) we build the queue that will contain the query terms, at this point we
> connect again there terms to some field, but :
>
> ...
>> // go through all the fields and find the largest document frequency
>> String topField = fieldNames[0];
>> int docFreq = 0;
>> for (String fieldName : fieldNames) {
>>   int freq = ir.docFreq(new Term(fieldName, word));
>>   topField = (freq > docFreq) ? fieldName : topField;
>>   docFreq = (freq > docFreq) ? freq : docFreq;
>> }
>> ...
>
>
> We identify the topField as the field with the highest document frequency
> for the term t .
> Then we build the termQuery :
>
> queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>
> In this way we lose a lot of precision.
> Not sure why we do that.
> I would prefer to keep the relation between terms and fields.
> The MLT query can improve a lot the quality.
> If i run the MLT on 2 fields : *description* and *facilities* for example.
> It is likely I want to find documents with similar terms in the
> description and similar terms in the facilities, without mixing up the
> things and loosing the semantic of the terms.
>
> Let me know your opinion,
>
> Cheers
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
Anshum Gupta


Re: [More Like This] Query building

2015-12-29 Thread Alessandro Benedetti
Sure, I will proceed tomorrow with the Jira and the simple patch + tests.

In the meantime let's try to collect some additional feedback.

Cheers

On 29 December 2015 at 12:43, Anshum Gupta  wrote:

> Feel free to create a JIRA and put up a patch if you can.
>
> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
> abenede...@apache.org
> > wrote:
>
> > Hi guys,
> > While I was exploring the way we build the More Like This query, I
> > discovered a part I am not convinced of :
> >
> >
> >
> > Let's see how we build the query :
> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> >
> > 1) we extract the terms from the interesting fields, adding them to a
> map :
> >
> > Map termFreqMap = new HashMap<>();
> >
> > *( we lose the relation field-> term, we don't know anymore where the
> term
> > was coming ! )*
> >
> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> >
> > 2) we build the queue that will contain the query terms, at this point we
> > connect again there terms to some field, but :
> >
> > ...
> >> // go through all the fields and find the largest document frequency
> >> String topField = fieldNames[0];
> >> int docFreq = 0;
> >> for (String fieldName : fieldNames) {
> >>   int freq = ir.docFreq(new Term(fieldName, word));
> >>   topField = (freq > docFreq) ? fieldName : topField;
> >>   docFreq = (freq > docFreq) ? freq : docFreq;
> >> }
> >> ...
> >
> >
> > We identify the topField as the field with the highest document frequency
> > for the term t .
> > Then we build the termQuery :
> >
> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
> >
> > In this way we lose a lot of precision.
> > Not sure why we do that.
> > I would prefer to keep the relation between terms and fields.
> > The MLT query can improve a lot the quality.
> > If i run the MLT on 2 fields : *description* and *facilities* for
> example.
> > It is likely I want to find documents with similar terms in the
> > description and similar terms in the facilities, without mixing up the
> > things and loosing the semantic of the terms.
> >
> > Let me know your opinion,
> >
> > Cheers
> >
> >
> > --
> > --
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> Anshum Gupta
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England