Hi Garren, cool, this is a good start.

On the ICU side of things, Russell pointed out that sort keys are a one-way 
trip; i.e., there’s no way to recover the original string from a sort key. For 
the initial pass at Mango I think that’s OK, as we’re reading the indexed 
documents anyway. When we get to views I guess the design will need to store 
the original string in the value so that we can return it as the “key” field in 
the response.

Adam

> On Mar 28, 2019, at 7:01 AM, Garren Smith <gar...@apache.org> wrote:
> 
> Hi everyone,
> 
> 
> I want to start a discussion, with the aim of an RFC, around implementing
> Mango JSON indexes for FoundationDB. Currently Mango indexes are a layer
> above CouchDB map/reduce indexes, but with FoundationDB we can make them
> separate indexes in FoundationDB. This gives us the possibility of being
> able to update the indexes in the same transaction that a document is being
> saved in. Later we can look at adding specific mango like covering indexes.
> 
> 
> Lets dive into the data model. Currently a user defines an index like this:
> 
> 
> {
> 
>  name: ‘view-name’ - optional will be auto-generated
> 
>  index: {
> 
>    fields: [‘fieldA’, ‘fieldB’]
> 
>  },
> 
>  partial_filter_selector {} - optional
> 
> }
> 
> 
> For query planning we need to be able to access the list of available
> indexes. So we would have a index_definitions subspace with the following
> content:
> 
> 
> (<fieldname1>, …<rest of fields>) = (<index_name>,
> <partial_filter_selector>)
> 
> 
> Otherwise we could just store the index definitions as:
> 
> (index_name) = ((fields), partial_filter_selector).
> 
> 
> At this stage, I can’t think of a fancy way of storing the index
> definitions so that when we need to select an index for a query there would
> be a fast way to only fetch a subset of the indexes. I think the best is to
> rather fetch them all like we currently do and process them. However, we
> can look at caching these index definitions in the application layer, and
> using FoundationDB watches[0] to notify us when a definition has changed so
> we can update the cached definitions.
> 
> 
> Then each index definition will have its own dedicated subspace for the
> actual built index key/values. Keys in this subspace would be the fields
> defined in the index with the doc id at the end of the tuple, e.g for an
> index with fields name and age, it would be:
> 
> 
> (“john”, 40, “doc-id-1) = null
> 
> (“mary”, 30, “doc-id-2) = null
> 
> 
> This follows the same key format that document layer[1] does for its
> indexes. One point to make here is that the doc id is kept in the key part
> so that we can avoid duplicate keys.
> 
> 
> Then in terms of sorting the keys, current CouchDB uses ICU to sort all
> secondary indexes. We would need to use ICU to sort the indexes for FDB but
> we would have to do it differently. We will not be able to use ICU
> collation operations directly, instead, we are going to have to look at
> using ICU’s sort key[1] to generate a sort key ahead of time. At the same
> time we need to look at creating binary encoding to capture the way that
> CouchDB currently sorts object, array and numbers. This would most likely
> be a sort of key prefix that we add to each key field along with the sort
> key generated from ICU.
> 
> 
> In terms of keeping mango indexes up to date, we should be able to update
> all existing indexes in the same transaction as a document is
> updated/created, this means we shouldn’t have to have any background
> process keeping mango indexes updated. Though I imagine we going to have to
> look at a background process that does update and build new indexes on an
> existing index. We will have to do some decent performance testing around
> this to determine the best solution, but looking at document layer they
> seem to recommend updating the indexes in the transaction rather than in a
> background process.
> 
> 
> In the future, we could look at using the value space to store covering
> indexed or materialized views. That way we would not need to always read
> from the by_id when quering with Mango. Which would be a nice performance
> improvement.
> 
> 
> 
> Please let me know any thoughts, improvements, suggestions or questions
> around this.
> 
> 
> 
> [0] https://apple.github.io/foundationdb/features.html#watches
> 
> [1] https://github.com/FoundationDB/fdb-document-layer
> 
> [2] http://userguide.icu-project.org/collation/api#TOC-Sort-Key-Features

Reply via email to