Odg: Question about INDEX_THRESHOLD_SIZE

2022-06-07 Thread Mario Kevo

The PR and questions surrounding it, aren’t the cause of what you are seeing 
now.



I’ll jump into debugging this a bit when I get the chance – haven’t had a 
chance to sit down with a debugger just yet.



Regards,

-Jason




From: Mario Kevo 
Date: Monday, March 14, 2022 at 8:17 AM
To: dev@geode.apache.org 
Subject: Odg: Question about INDEX_THRESHOLD_SIZE
Hi,

Regarding Anil's answer:
The test will pass as we have set INDEX_THRESHOLD_SIZE to 1000 (that is how 
many entries are in the region). If you deleted that line in startServer it 
will fail asit used default value of indexThresholdSize(100).

Regarding Jason's answers:
Only change what my PR introduced to get availability to change 
INDEX_THRESHOLD_SIZE by the user as before it is hardcoded to 100, and nothing 
can change it.
The intermediate result will have all entries on which index can be applied and 
if there are more entries that have that attribute on which index is created it 
will store them all in the results(if INDEX_THRESHOLD_SIZE is changed). By 
default, it has a limit of 100 entries so it can happen that an entry that 
matches the condition will not be in the results.
Regarding this applyCondition in CompactRangeIndex.addToResultsFromEntries, it 
is never applied as iterOps is always null, as it is set in 
AbstractGroupOrRangeJunction.auxFilterEvaluate:

filterResults = filter.filterEvaluate(context,
!isConditioningNeeded ? intermediateResults : null, completeExpansion,
null/*
* Asif * Asif :The iter operands passed are null, as a not null value can 
exists only
* if there exists a single Filter operand in original GroupJunction
*/, indpndntItr, _operator == LITERAL_and, isConditioningNeeded,
false /* do not evaluate projection */);
​BR,
Mario

Šalje: Jason Huynh 
Poslano: 11. ožujka 2022. 22:06
Prima: dev@geode.apache.org 
Predmet: Re: Question about INDEX_THRESHOLD_SIZE

As an fyi, in the past we disabled applying limits at the index level for range 
indexes.

I’m surprised in this case that we would add all the entries to the 
intermediate results instead of applying the filter first and checking the 
condition before adding to the intermediate results..

I would have thought it would have to apply the condition as seen in 
CompactRangeIndex.addToResultsFromEntries :

  if (ok && runtimeItr != null && iterOps != null) {
  ok = QueryUtils.applyCondition(iterOps, context);
}

I haven’t walked this query through the code, perhaps it’s hitting a different 
index type (I’d think a map index but that probably is backed by 
CompactRangeIndexes for each key…)



From: Jason Huynh 
Date: Friday, March 11, 2022 at 12:47 PM
To: dev@geode.apache.org 
Subject: Re: Question about INDEX_THRESHOLD_SIZE
I think 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7010data=04%7C01%7Cjhuynh%40vmware.com%7Cfc3f9301ac074a3e694808da06b7%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637829714375501933%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2FZRbcTiJzvRXrM9TXN3hTcsS%2BXr51ZAXfiuRu4FNpxo%3Dreserved=0
 may have changed what that property represented.  I believe it was some 
arbitrary threshold to abort using index look ups (if the intermediate results 
were small, it was deemed faster to just iterate through and not do a lookup – 
at least from my interpretation of the code)
It looks like with the change, it now munges it with limit.. so now limit is 
applied to that value.. gfsh happens to always pass in a limit too, so there is 
possibly additional confusion

From the diff there is also one spot where a limit != -1 had not been added..  
In CompactRangeIndex line 489:

  if (limit < indexThresholdSize) {
limit = indexThresholdSize;
  }

This might be affecting the usage of limit at the index level?


From: Anilkumar Gingade 
Date: Friday, March 11, 2022 at 12:11 PM
To: dev@geode.apache.org 
Subject: Re: Question about INDEX_THRESHOLD_SIZE
Mario,

There is similar test/example added by you in QueryWithRangeIndexDUnitTest. 
testQueryWithWildcardAndIndexOnAttributeFromHashMap()
When I run that test (on develop); I see the results as expected:
*
Command result for :
Result  : true
Limit   : 100
Rows: 1
Query Trace : Query Executed in 85.1964 ms; indexesUsed(1):IdIndex(Results: 
1)

Are you running your test with any additional change like as you are saying :
>> I was working on allowing INDEX_THRESHOLD_SIZE System property to override 
>> CompiledValue.RESULT_LIMIT.

If so , you need to look at the change and see why its impacting that way.
If not, please let me know what change can be made in that test to reproduce 
the issue you are seeing; that will help to debug/analyze the issue.

-Anil.




On 3/11/22, 12:18 AM, "Mario Kevo"  wrote:

Hi,

It works without an i

Odg: Question about INDEX_THRESHOLD_SIZE

2022-03-14 Thread Mario Kevo
Hi,

Regarding Anil's answer:
The test will pass as we have set INDEX_THRESHOLD_SIZE to 1000 (that is how 
many entries are in the region). If you deleted that line in startServer it 
will fail asit used default value of indexThresholdSize(100).

Regarding Jason's answers:
Only change what my PR introduced to get availability to change 
INDEX_THRESHOLD_SIZE by the user as before it is hardcoded to 100, and nothing 
can change it.
The intermediate result will have all entries on which index can be applied and 
if there are more entries that have that attribute on which index is created it 
will store them all in the results(if INDEX_THRESHOLD_SIZE is changed). By 
default, it has a limit of 100 entries so it can happen that an entry that 
matches the condition will not be in the results.
Regarding this applyCondition in CompactRangeIndex.addToResultsFromEntries, it 
is never applied as iterOps is always null, as it is set in 
AbstractGroupOrRangeJunction.auxFilterEvaluate:

filterResults = filter.filterEvaluate(context,
!isConditioningNeeded ? intermediateResults : null, completeExpansion,
null/*
* Asif * Asif :The iter operands passed are null, as a not null value can 
exists only
* if there exists a single Filter operand in original GroupJunction
*/, indpndntItr, _operator == LITERAL_and, isConditioningNeeded,
false /* do not evaluate projection */);
​BR,
Mario

Šalje: Jason Huynh 
Poslano: 11. ožujka 2022. 22:06
Prima: dev@geode.apache.org 
Predmet: Re: Question about INDEX_THRESHOLD_SIZE

As an fyi, in the past we disabled applying limits at the index level for range 
indexes.

I’m surprised in this case that we would add all the entries to the 
intermediate results instead of applying the filter first and checking the 
condition before adding to the intermediate results..

I would have thought it would have to apply the condition as seen in 
CompactRangeIndex.addToResultsFromEntries :

  if (ok && runtimeItr != null && iterOps != null) {
  ok = QueryUtils.applyCondition(iterOps, context);
}

I haven’t walked this query through the code, perhaps it’s hitting a different 
index type (I’d think a map index but that probably is backed by 
CompactRangeIndexes for each key…)



From: Jason Huynh 
Date: Friday, March 11, 2022 at 12:47 PM
To: dev@geode.apache.org 
Subject: Re: Question about INDEX_THRESHOLD_SIZE
I think https://github.com/apache/geode/pull/7010 may have changed what that 
property represented.  I believe it was some arbitrary threshold to abort using 
index look ups (if the intermediate results were small, it was deemed faster to 
just iterate through and not do a lookup – at least from my interpretation of 
the code)
It looks like with the change, it now munges it with limit.. so now limit is 
applied to that value.. gfsh happens to always pass in a limit too, so there is 
possibly additional confusion

From the diff there is also one spot where a limit != -1 had not been added..  
In CompactRangeIndex line 489:

  if (limit < indexThresholdSize) {
limit = indexThresholdSize;
  }

This might be affecting the usage of limit at the index level?


From: Anilkumar Gingade 
Date: Friday, March 11, 2022 at 12:11 PM
To: dev@geode.apache.org 
Subject: Re: Question about INDEX_THRESHOLD_SIZE
Mario,

There is similar test/example added by you in QueryWithRangeIndexDUnitTest. 
testQueryWithWildcardAndIndexOnAttributeFromHashMap()
When I run that test (on develop); I see the results as expected:
*
Command result for :
Result  : true
Limit   : 100
Rows: 1
Query Trace : Query Executed in 85.1964 ms; indexesUsed(1):IdIndex(Results: 
1)

Are you running your test with any additional change like as you are saying :
>> I was working on allowing INDEX_THRESHOLD_SIZE System property to override 
>> CompiledValue.RESULT_LIMIT.

If so , you need to look at the change and see why its impacting that way.
If not, please let me know what change can be made in that test to reproduce 
the issue you are seeing; that will help to debug/analyze the issue.

-Anil.




On 3/11/22, 12:18 AM, "Mario Kevo"  wrote:

Hi,

It works without an index but it doesn't work with an index.
When I revert changes, it takes INDEX_THRESHOLD_SIZE default value(100). 
And if the entry that matches the condition is not in that resultset it will 
not be printed.
Without index:
​gfsh>query --query="SELECT e.key, e.value from 
/example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
Result  : true
Limit   : 100
Rows: 1
Query Trace : Query Executed in 11.502283 ms; indexesUsed(0)

key | value
--- | 

300 | 

Odg: Question about INDEX_THRESHOLD_SIZE

2022-03-11 Thread Mario Kevo
Hi,

It works without an index but it doesn't work with an index.
When I revert changes, it takes INDEX_THRESHOLD_SIZE default value(100). And if 
the entry that matches the condition is not in that resultset it will not be 
printed.
Without index:
​gfsh>query --query="SELECT e.key, e.value from /example-region.entrySet 
e where e.value.positions['SUN'] like 'someth%'"
Result  : true
Limit   : 100
Rows: 1
Query Trace : Query Executed in 11.502283 ms; indexesUsed(0)

key | value
--- | 

300 | 
{"ID":300,"indexKey":0,"pkid":"300","shortID":null,"position1":{"mktValue":1945.0,"secId":"ORCL","secIdIndexed":"ORCL","secType":null,"sharesOutstanding":1944000.0,"underlyer":null,"pid":1944,"portfolioId":300,..
​With index:
gfsh>query --query="SELECT e.key, e.value from /example-region.entrySet 
e where e.value.positions['SUN'] like 'someth%'"
Result  : true
Limit   : 100
Rows: 0
Query Trace : Query Executed in 8.784831 ms; indexesUsed(1):index1(Results: 100)
​BR,
Mario

Šalje: Anilkumar Gingade 
Poslano: 10. ožujka 2022. 23:16
Prima: dev@geode.apache.org 
Predmet: Re: Question about INDEX_THRESHOLD_SIZE

Mario,

There are few changes happened around this area as part of GEODE-9632 fix; can 
you please revert that change and see if the query works both with and without 
index.
Looking at the code; it seems to restrict the number index look up that needs 
to be performed; certain latency/throughput sensitive queries that or not 
expecting exact result may use this (my guess) but by default it should not be 
resulting in unexpected results.

-Anil.


On 3/10/22, 6:50 AM, "Mario Kevo"  wrote:

Hi geode-dev,

Some time ago I was working on allowing INDEX_THRESHOLD_SIZE System 
property to override CompiledValue.RESULT_LIMIT.
After this change, adding this attribute will take into a count if you set 
it.
But I need some clarification of this INDEX_THRESHOLD_SIZE attribute. Why 
is this set by default to 100?
The main problem with this attribute is that if you want to get the correct 
result, you need to know how many entries will be in the region while starting 
servers and set it on that value or higher. Sometimes it is too hard to know 
how many entries will be in the region, so maybe better will be to set it by 
default to some higher number, something like Integer.MAX_VALUE.

Where this attribute is used?
It is used to get index results while doing queries.

What is the problem?
If we have INDEX_THRESHOLD_SIZE set to 500, and we have 1k entries it can 
happen that while doing a query it will get only 500 entries and where clause 
cannot be fulfilled and we got no results.
Let's see it by an example!

We have only one entry that matches the condition from the query, 
INDEX_THRESHOLD_SIZE set to 500, and 1k entries in the region.
If we run the query without an index we got the result.
gfsh>query --query="SELECT e.key, e.value from 
/example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
Result  : true
Limit   : 100
Rows: 1
Query Trace : Query Executed in 10.750238 ms; indexesUsed(0)

key | value
--- | 

700 | 
{"ID":700,"indexKey":0,"pkid":"700","shortID":null,"position1":{"mktValue":1945.0,"secId":"ORCL","secIdIndexed":"ORCL","secType":null,"sharesOutstanding":1944000.0,"underlyer":null,"pid":1944,"portfolioId":700,..
​If we create an index and then run again this query there is no result.
gfsh>query --query="SELECT e.key, e.value from 
/example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
Result  : true
Limit   : 100
Rows: 0
Query Trace : Query Executed in 22.079016 ms; 
indexesUsed(1):index1(Results: 500)
​This happened because we have no luck getting that entry that matches the 
condition in the intermediate results for the index.
So the questions are:
What if more entries enter the region that will make the index return more 
entries than this threshold sets? Then we're again in jeopardy that the query 
condition will not match.
Why is this attribute set by default to 100?
Can we change it to the Integer.MAX_VALUE by default to be sure that we 
have the correct result? What are the consequences?

BR,
Mario