Hi,

We have observed that creating an index on a Map field causes the creation of 
an index entry for every entry created in the region containing the Map, no 
matter if the Map field contained the key used in the index.
Nevertheless, we would expect that only entries whose Map field contain the key 
used in the index would have the corresponding index entry. With this behavior, 
the memory consumed by the index could be much higher than needed depending on 
the percentage of entries whose Map field contained the key in the index.

---------------------------------------------------
Example:
We have a region with entries whose key type is a String and the value type is 
an object with a field called "field1" of Map type.

We expect to run queries on the region like the following:

SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1"

We create a Map index to speed up the above queries:

gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" 
--region="/example-region1 r"

We do the following puts:
- Put entry with key="key1" and with value=<Object whose field "field1" is a 
Map that contains the key 'mapkey1'>
- Put entry with key="key2" and with value=<Object whose field "field1" is a 
Map that does not contain the key 'mapkey1'>

The observation is that Geode creates two index entries for each entry. For the 
first entry, the internal indexKey is "key1" and for the second one, the 
internal indexKey is null.

These are the stats shown by gfsh after doing the above puts:

gfsh>list indexes --with-stats=yes
Member Name |                Member ID                |   Region Path    |   
Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses 
| Updates | Update Time | Keys | Values
----------- | --------------------------------------- | ---------------- | 
-------- | ----- | --------------------------------- | ------------------ | 
----------- | ---- | ------- | ----------- | ---- | ------
server1     | 192.168.0.26(server1:1109606)<v1>:41000 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1   
 | 1       | 0           | 1    | 1
server2     | 192.168.0.26(server2:1109695)<v2>:41001 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1   
 | 1       | 0           | 1    | 1
---------------------------------------------------

Is there any reason why Geode would create an index entry for the second entry 
given that the Map field does not contain the key in the Map index?

I have created a draft pull request changing the behavior of Geode to not 
create the index entry when the Map field does not contain the key used in the 
index. Only two Unit test cases had to be adjusted. Please see: 
https://github.com/apache/geode/pull/6028

With this change and the same scenario as the one in the example, only one 
index entry is created. The stats shown by gfsh after the change are the 
following:

gfsh>list indexes --with-stats=yes
Member Name |                Member ID                |   Region Path    |   
Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses 
| Updates | Update Time | Keys | Values
----------- | --------------------------------------- | ---------------- | 
-------- | ----- | --------------------------------- | ------------------ | 
----------- | ---- | ------- | ----------- | ---- | ------
server1     | 192.168.0.26(server1:1102192)<v1>:41000 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2   
 | 1       | 0           | 0    | 0
server2     | 192.168.0.26(server2:1102279)<v2>:41001 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2   
 | 1       | 0           | 1    | 1


Could someone tell if the current behavior is not correct or if I am missing 
something and with the change I am proposing something else will stop working?

Thanks in advance,

/Alberto G.

Reply via email to