Re: Ignore a field in the scoring

2015-01-08 Thread Roger de Cordova Farias
Thank you very much

2015-01-08 4:35 GMT-02:00 Masaru Hasegawa :

> Hi,
>
> I believe it's intended according to
> https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
> .
> It says:
> --
> Note that CollectionStatistics.maxDoc() is used instead of
> IndexReader#numDocs() because also TermStatistics.docFreq() is used, and
> when the latter is inaccurate, so is CollectionStatistics.maxDoc(), and in
> the same direction. In addition, CollectionStatistics.maxDoc() is more
> efficient to compute
> --
>
> Masaru
>
> On Thu, Jan 8, 2015 at 12:01 AM, Roger de Cordova Farias <
> roger.far...@fontec.inf.br> wrote:
>
>> Thank you for your explanation
>>
>> Do you know if it is a bug of intended behavior?
>>
>> I don't think deleted (marked as deleted) docs should be used at all
>>
>> 2015-01-07 1:53 GMT-02:00 Masaru Hasegawa :
>>
>>> Hi,
>>>
>>> Update is delete and add. I mean, instead of updating existing document,
>>> it deletes it and adds it as new document.
>>> And those deleted documents are just marked as deleted and aren’t
>>> actually removed from index until the segment merge.
>>>
>>> IDF doesn’t take those deleted-but-not-removed document into account (it
>>> counts those documents).
>>> That’s the reason you see different IDF score (you see both maxDocs and
>>> docFreq are incremented).
>>>
>>> Regarding 424 v.s. 0, the document had ID 424 (lucene’s internal ID).
>>> But when the document is updated (delete + add), it got new ID 0 in new
>>> segment.
>>>
>>> So, I think it’s not possible to keep score when you update documents.
>>> You can run optimise with max_num_segments=1 every time you update
>>> documents but it’s not practical (and until optimise is done, you see
>>> different score)
>>>
>>>
>>> Masaru
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/etPan.54acade5.625558ec.13b%40citra.local
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAJp2531fazjRDeFMmWLVuoCtCUtbCUMv841O%2BZoFpMJBdcjRDA%40mail.gmail.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAGmu3c1rWBCuaLrwHY818sy%2BcM6wEYzNivcFMjzbqupW_7paAw%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJp2533-8TBoyPmfpqj12T_TVb4z%2BrgLKqtuOxRfReajti7WfA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Ignore a field in the scoring

2015-01-07 Thread Masaru Hasegawa
Hi,

I believe it's intended according to
https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
.
It says:
--
Note that CollectionStatistics.maxDoc() is used instead of
IndexReader#numDocs() because also TermStatistics.docFreq() is used, and
when the latter is inaccurate, so is CollectionStatistics.maxDoc(), and in
the same direction. In addition, CollectionStatistics.maxDoc() is more
efficient to compute
--

Masaru

On Thu, Jan 8, 2015 at 12:01 AM, Roger de Cordova Farias <
roger.far...@fontec.inf.br> wrote:

> Thank you for your explanation
>
> Do you know if it is a bug of intended behavior?
>
> I don't think deleted (marked as deleted) docs should be used at all
>
> 2015-01-07 1:53 GMT-02:00 Masaru Hasegawa :
>
>> Hi,
>>
>> Update is delete and add. I mean, instead of updating existing document,
>> it deletes it and adds it as new document.
>> And those deleted documents are just marked as deleted and aren’t
>> actually removed from index until the segment merge.
>>
>> IDF doesn’t take those deleted-but-not-removed document into account (it
>> counts those documents).
>> That’s the reason you see different IDF score (you see both maxDocs and
>> docFreq are incremented).
>>
>> Regarding 424 v.s. 0, the document had ID 424 (lucene’s internal ID). But
>> when the document is updated (delete + add), it got new ID 0 in new segment.
>>
>> So, I think it’s not possible to keep score when you update documents.
>> You can run optimise with max_num_segments=1 every time you update
>> documents but it’s not practical (and until optimise is done, you see
>> different score)
>>
>>
>> Masaru
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/etPan.54acade5.625558ec.13b%40citra.local
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAJp2531fazjRDeFMmWLVuoCtCUtbCUMv841O%2BZoFpMJBdcjRDA%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGmu3c1rWBCuaLrwHY818sy%2BcM6wEYzNivcFMjzbqupW_7paAw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Ignore a field in the scoring

2015-01-07 Thread Roger de Cordova Farias
Thank you for your explanation

Do you know if it is a bug of intended behavior?

I don't think deleted (marked as deleted) docs should be used at all

2015-01-07 1:53 GMT-02:00 Masaru Hasegawa :

> Hi,
>
> Update is delete and add. I mean, instead of updating existing document,
> it deletes it and adds it as new document.
> And those deleted documents are just marked as deleted and aren’t actually
> removed from index until the segment merge.
>
> IDF doesn’t take those deleted-but-not-removed document into account (it
> counts those documents).
> That’s the reason you see different IDF score (you see both maxDocs and
> docFreq are incremented).
>
> Regarding 424 v.s. 0, the document had ID 424 (lucene’s internal ID). But
> when the document is updated (delete + add), it got new ID 0 in new segment.
>
> So, I think it’s not possible to keep score when you update documents.
> You can run optimise with max_num_segments=1 every time you update
> documents but it’s not practical (and until optimise is done, you see
> different score)
>
>
> Masaru
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/etPan.54acade5.625558ec.13b%40citra.local
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJp2531fazjRDeFMmWLVuoCtCUtbCUMv841O%2BZoFpMJBdcjRDA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Ignore a field in the scoring

2015-01-06 Thread Masaru Hasegawa
Hi,

Update is delete and add. I mean, instead of updating existing document, it 
deletes it and adds it as new document.
And those deleted documents are just marked as deleted and aren’t actually 
removed from index until the segment merge.

IDF doesn’t take those deleted-but-not-removed document into account (it counts 
those documents).
That’s the reason you see different IDF score (you see both maxDocs and docFreq 
are incremented).

Regarding 424 v.s. 0, the document had ID 424 (lucene’s internal ID). But when 
the document is updated (delete + add), it got new ID 0 in new segment.

So, I think it’s not possible to keep score when you update documents.
You can run optimise with max_num_segments=1 every time you update documents 
but it’s not practical (and until optimise is done, you see different score)


Masaru



-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/etPan.54acade5.625558ec.13b%40citra.local.
For more options, visit https://groups.google.com/d/optout.


Re: Ignore a field in the scoring

2015-01-05 Thread Roger de Cordova Farias
Now I ran the query with explain = true. The results are the following:


*Explain before the update:*


>   "details": [
> {
>   "value": 5.752348,
>   "description": "fieldWeight in 424, product of:",
>   "details": [
> {
>   "value": 1,
>   "description": "tf(freq=1.0), with freq of:",
>   "details": [
> {
>   "value": 1,
>   "description": "termFreq=1.0"
> }
>   ]
> },
> {
>   "value": 9.203756,
>   "description": "idf(docFreq=201, maxDocs=738240)"
> },
> {
>   "value": 0.625,
>   "description": "fieldNorm(doc=424)"
> }
>   ]
> }
>   ]



*Update script (scriptLang = groovy, profileId = 1):*

if (ctx._source.bookmarked_by == null) {
> ctx._source.bookmarked_by = [profileId]
> } else if (ctx._source.bookmarked_by.contains(profileId)) {
> ctx.op = "none"
> } else {
> ctx._source.bookmarked_by += profileId
> }



*Explain after the update:*

  "details": [
> {
>   "value": 5.749262,
>   "description": "fieldWeight in 0, product of:",
>   "details": [
> {
>   "value": 1,
>   "description": "tf(freq=1.0), with freq of:",
>   "details": [
> {
>   "value": 1,
>   "description": "termFreq=1.0"
> }
>   ]
> },
> {
>   "value": 9.198819,
>   "description": "idf(docFreq=202, maxDocs=738241)"
> },
> {
>   "value": 0.625,
>   "description": "fieldNorm(doc=0)"
> }
>   ]
> }
>   ]



* Query used with the explain:*

{
>   "query": {
> "query_string": {
>   "fields": [
> "name"
>   ],
>   "query": "roger"
> }
>   }
> }





The inverse document frequency (idf) is changed after adding a new field
that is not used in the query. Also, it changed the "fieldWeight in 424"
and "fieldNorm(doc=424)" to  "fieldWeight in 0" and "fieldNorm(doc=0)" (idk
if it changes something)

Can someone help me on how to not change the score of the document after
running the update? Note that the update creates a new field if it was not
found (== null), but this field is not used in the query

2015-01-05 13:35 GMT-02:00 Roger de Cordova Farias <
roger.far...@fontec.inf.br>:

> The added field is an array of Integers, but we are not using it in the
> query at all
>
> We are not querying the _all field, it is disabled in our type mapping
>
> Our query is something like this:
>
> {
>>   "query": {
>> "query_string": {
>>   "fields": [
>> "name"
>>   ],
>>   "query": "roger"
>> }
>>   }
>> }
>
>
> I ran this query. In the first result, I added a new field called
> "bookmarked_by" with a numeric value. Then I ran the same query again. The
> document in which I added the new field is no longer the first result
>
> 2014-12-26 17:34 GMT-02:00 Doug Turnbull <
> dturnb...@opensourceconnections.com>:
>
> Are you querying the _all field? How are you doing your searches?
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html
>>
>> The _all field receives a copy of every  field you index, so adding data
>> here could impact scores regardless of the source field.
>>
>> Otherwise, fields are scored independently before being put together by
>> other queries like boolean queries or dismax. Are you using
>> boolean/dismax/etc over multiple fields?
>>
>> -Doug
>>
>> On Fri, Dec 26, 2014 at 11:59 AM, Ivan Brusic  wrote:
>>
>>> Use the field in a filter and not part of the query. Is this field free
>>> text?
>>>
>>> Ivan
>>> On Dec 23, 2014 9:12 PM, "Roger de Cordova Farias" <
>>> roger.far...@fontec.inf.br> wrote:
>>>
 Hello

 Our documents have metadata indexed with them, but we don't want the
 metadata to interfere in the scoring

 After a user searches for documents, they can bookmark them (what means
 we add more metadata to the document), then in the next search with the
 same query the bookmarked document  appears in a lower (worse) position

 Is there a way to completely ignore one or more specific fields in the
 scoring of every query? as in indexing time or something?

 Note that we are not using the metadata field in the query, but yet it
 lowers the score of every query

 We cannot set the "index" attribute of this field to "no" because we
 are gonna use it in other queries

 --
 You received th

Re: Ignore a field in the scoring

2015-01-05 Thread Roger de Cordova Farias
The added field is an array of Integers, but we are not using it in the
query at all

We are not querying the _all field, it is disabled in our type mapping

Our query is something like this:

{
>   "query": {
> "query_string": {
>   "fields": [
> "name"
>   ],
>   "query": "roger"
> }
>   }
> }


I ran this query. In the first result, I added a new field called
"bookmarked_by" with a numeric value. Then I ran the same query again. The
document in which I added the new field is no longer the first result

2014-12-26 17:34 GMT-02:00 Doug Turnbull <
dturnb...@opensourceconnections.com>:

> Are you querying the _all field? How are you doing your searches?
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html
>
> The _all field receives a copy of every  field you index, so adding data
> here could impact scores regardless of the source field.
>
> Otherwise, fields are scored independently before being put together by
> other queries like boolean queries or dismax. Are you using
> boolean/dismax/etc over multiple fields?
>
> -Doug
>
> On Fri, Dec 26, 2014 at 11:59 AM, Ivan Brusic  wrote:
>
>> Use the field in a filter and not part of the query. Is this field free
>> text?
>>
>> Ivan
>> On Dec 23, 2014 9:12 PM, "Roger de Cordova Farias" <
>> roger.far...@fontec.inf.br> wrote:
>>
>>> Hello
>>>
>>> Our documents have metadata indexed with them, but we don't want the
>>> metadata to interfere in the scoring
>>>
>>> After a user searches for documents, they can bookmark them (what means
>>> we add more metadata to the document), then in the next search with the
>>> same query the bookmarked document  appears in a lower (worse) position
>>>
>>> Is there a way to completely ignore one or more specific fields in the
>>> scoring of every query? as in indexing time or something?
>>>
>>> Note that we are not using the metadata field in the query, but yet it
>>> lowers the score of every query
>>>
>>> We cannot set the "index" attribute of this field to "no" because we are
>>> gonna use it in other queries
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CAJp2533Rjjec4SwXe_p-0eHYkkyEegFyP9DUMGQfHhua8ZyMWQ%40mail.gmail.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAKsYquQJMbfztJ%2Ba2_jpi-fVG%3DvcnXYHS-7bKvaOX4hA%40mail.gmail.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> Doug Turnbull
> Search & Big Data Architect
> OpenSource Connections 
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CALG6HL9ND_SWteSetZL9059WyGRZvJrO2k4PQ9FQ1zUFhjbsxw%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJp2533UjpAz2dvNitdD-%3DaoXL9rrkZdd%3DzC3LZz8xWYvBAoFQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Ignore a field in the scoring

2014-12-26 Thread Doug Turnbull
Are you querying the _all field? How are you doing your searches?
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-all-field.html

The _all field receives a copy of every  field you index, so adding data
here could impact scores regardless of the source field.

Otherwise, fields are scored independently before being put together by
other queries like boolean queries or dismax. Are you using
boolean/dismax/etc over multiple fields?

-Doug

On Fri, Dec 26, 2014 at 11:59 AM, Ivan Brusic  wrote:

> Use the field in a filter and not part of the query. Is this field free
> text?
>
> Ivan
> On Dec 23, 2014 9:12 PM, "Roger de Cordova Farias" <
> roger.far...@fontec.inf.br> wrote:
>
>> Hello
>>
>> Our documents have metadata indexed with them, but we don't want the
>> metadata to interfere in the scoring
>>
>> After a user searches for documents, they can bookmark them (what means
>> we add more metadata to the document), then in the next search with the
>> same query the bookmarked document  appears in a lower (worse) position
>>
>> Is there a way to completely ignore one or more specific fields in the
>> scoring of every query? as in indexing time or something?
>>
>> Note that we are not using the metadata field in the query, but yet it
>> lowers the score of every query
>>
>> We cannot set the "index" attribute of this field to "no" because we are
>> gonna use it in other queries
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAJp2533Rjjec4SwXe_p-0eHYkkyEegFyP9DUMGQfHhua8ZyMWQ%40mail.gmail.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAKsYquQJMbfztJ%2Ba2_jpi-fVG%3DvcnXYHS-7bKvaOX4hA%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Doug Turnbull
Search & Big Data Architect
OpenSource Connections 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALG6HL9ND_SWteSetZL9059WyGRZvJrO2k4PQ9FQ1zUFhjbsxw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Ignore a field in the scoring

2014-12-26 Thread Ivan Brusic
Use the field in a filter and not part of the query. Is this field free
text?

Ivan
On Dec 23, 2014 9:12 PM, "Roger de Cordova Farias" <
roger.far...@fontec.inf.br> wrote:

> Hello
>
> Our documents have metadata indexed with them, but we don't want the
> metadata to interfere in the scoring
>
> After a user searches for documents, they can bookmark them (what means we
> add more metadata to the document), then in the next search with the same
> query the bookmarked document  appears in a lower (worse) position
>
> Is there a way to completely ignore one or more specific fields in the
> scoring of every query? as in indexing time or something?
>
> Note that we are not using the metadata field in the query, but yet it
> lowers the score of every query
>
> We cannot set the "index" attribute of this field to "no" because we are
> gonna use it in other queries
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAJp2533Rjjec4SwXe_p-0eHYkkyEegFyP9DUMGQfHhua8ZyMWQ%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAKsYquQJMbfztJ%2Ba2_jpi-fVG%3DvcnXYHS-7bKvaOX4hA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.