gortiz opened a new issue, #10183:
URL: https://github.com/apache/pinot/issues/10183
## Why
The objective of this proposal is to add the ability to include new index
types at runtime in Apache Pinot. This opens the ability of adding third party
indexes, including proprietary indexes. When we talk about index types in this
proposal, we are focused on single field index. Constructions like StarTree
indexes or future indexes that index more than one column are deliberately
ignored.
## The plan
All ideas presented here have been already drafted in this branch [pending
to be created], where we have been exploring the problem in order to find the
solution we propose here. Given the large number of changed in that branch, we
think that it is better to use that branch as a reference, but split the
changes in different PRs. Right now the idea is to create partial PRs in the
following order:
- [ ] Add the `IndexType` without modifying already existing code. Include
few example implementations.
- [ ] Add the `IndexService` without modifying already existing code.
- [ ] Partially implement `IndexType` for each existent index type without
modifying already existing code. Methods that define how to create
`IndexCreator`s, `IndexReader`s, `IndexHandler`s, etc are not going to be
implemented here.
- [ ] Refactor existing code to use `IndexType` whenever `ColumnIdexType` is
being used right now.
- [ ] Transform old configs (including usage of `IndexLoadingConfig` and
`SegmentGeneratorConfig`)
- [ ] Refactor each `IndexCreator` to have a common interface.
- [ ] Remove `IndexCreatorProvider` and `IndexReaderProvider` interfaces and
implementations, moving the code to each `IndexType`. Here classes which
delegate on these providers (like `PhysicalColumnIndexContainer`) will be
modified to call the equivalent methods on IndexType.
- [ ] Refactor classes like `PhysicalColumnIndexContainer` or
`SegmentColumnarIndexCreator` to call just iterate over the `IndexType`s
returned by `IndexService` instead of explicitly ask for each one
## Current blockers
1. Index types are treated as enums, which are, by definition, decided at
compile time. For example, `ColumnIndexType`.
4. Index configuration is specified in `TableConfig`, specially in
`FieldConfig` and `IndexingConfig`. These classes are POJOs with one or more
attributes per index type. Therefore to add a new index it is needed to add
attributes and recompile the class.
5. There is at least one IndexCreator per index type, but each IndexCreator
is different to the other. They do not fulfill a common contract. The same
happen with IndexReader. Ideally IndexCreators and IndexReaders should be
closer to IndexHandlers, which do follow the same contract for each index type.
6. IndexCreators are created using different `XIndexCreatorProvider`, where
X is the index type (like `ForwardIndexCreatorProvider`) but:
1. Each provider has its own methods with its own arguments that derive
from `IndexCreationContext`.
2. These `XIndexCreatorProvider`s do not have a common interface.
3. There is a `IndexCreatorProvider` interface, but it extends all other
`XIndexCreatorProvider`. That pattern goes against the extensibility. For
example, in case a new index type called `Test` was added, we would need to
modify `IndexCreatorProvider` to also extend `TextIndexCreatorProvider` and
implement its method in all implementations, which obviously requires to
recompile the project.
4. Same happens with `IndexReaderProvider`
7. Same happens with `MutableIndexProvider`
5. Indexes are being used when `PlanNode`s are translated into physical
operators. Right now this transformation is totally static. In case we want to
add a new index we need to modify methods like
`FilterPlanNode.constructPhysicalOperator` or
`FilterOperatorUtils.getLeafFilterOperator` in order to add new transformation
rules, which are eagerly executed.
## The IndexService and the IndexType class
This proposal is evolves around the idea that indexes should be obtained
from an `IndexService`. There should be a single instance on every Pinot
component that uses indexes (like Pinot Servers). Optionally this singleton can
be substituted by a bunch of static methods in the `IndexService` class.
This `IndexService` will contain a map from a String _id_ to values of
`IndexType`, which has to be a new interface that completely identifies an
index. The `IndexType` interface will have methods that define the index id,
how to configure the index given a TableConfig and how to obtain its
`IndexCreator`, `IndexReader`, `IndexHandler` and optionally its
`MutableIndex`. These last methods will require information from the
environment. Specially they will require some contextual information (like
statistics for the column) and some configuration. There should be a single
`IndexType` instance for each index type in Pinot (ie one for forward indexes,
one for range indexes, etc).
Pinot will have a new configuration to specify which indexes should be
available and at startup Pinot will look for the corresponding `IndexType` in
the classpath and add it to the `IndexService` singleton. After that, whenever
Pinot requires to use an index Pinot will use the `IndexService` to get the
`IndexType` instance. To look for all indexes in the classpath, the draft
implementation uses an intermediate `IndexPlugin` interface that is discovered
using Java Service Providers.
## The index configuration
As said above, current way to configure is very static, requiring to add new
attributes at compile time to the POJOs we have. At the same time, indexes are
configured in different ways depending on their requirements and when it was
created.
Right now, index configuration is spread across different fields in the
`TableConfig`, which has two key resposibilities:
1. Being the class that is being used to define how the JSON is written by
the users
2. Being the class that devs use to create the readers and creators.
Although it is not impossible to correctly implement both things in the same
class, we think that it would be better to clearly differentiate between the
concrete syntax used by the user to write JSONs and the classes with the final
configuration (with default values resolved, for example). Having both
responsibilities in the same classes makes more difficult to modify the way
indexes are configured by the users without impacting on the code and the other
way around. For example, bloom filters were originally designed to be
configured as a list of columns that have to be indexed but at some point we
wanted to let users to customize the bloom filter properties, so `TableConfig`
contains both `bloomFilterColumns` and `bloomFilterConfigs`. Instead of being
constraint at `TableConfig` level, this duality permeated to all other parts of
the code, so a very common pattern was to first read `bloomFilterColumns` and
then `bloomFilterConfigs` in several places of the code.
Previous `TableConfig` has other problems:
- Documentation is spread between `tableIndexConfig` and `fieldConfigList`.
[Text search support doc
page](https://docs.pinot.apache.org/basics/indexing/text-search-support) says
that `fieldConfigList` is the newest way to configure indexes, but when using
that advanced configuration has to be done in a generic properties field, which
was mapped to a Map<String, String>. This is bad for various reasons:
- It creates a single namespace where properties of different index may
collide.
- It restricts the value of the configuration to strings, so in order to
support advanced configurations devs must parse the string their self and users
must know which syntax is expected for each value. This is even worse when the
parsing has to be done in several different places in the code (for example,
when creating an index and when reading an index).
- In order to add a new index, `TableConfig` source file has to be changed
to add a new attribute in order to be known by Jackson. That is strong blocker
in order to have index types defined at runtime.
- It is impossible to create instances or modify of `TableConfig` in a
programmatic type safe way. Therefore we introduced some abstractions like
`IndexLoadingConfig` or `ColumnIndexCreationInfo` so we can programmatically
modify the config (mainly, but not only, in tests).
- In order to create or read some indexes, we need not only the
configuration provided by the user but also some contextual information that
has to be pre-processed (like for example the max or min values). Therefore we
created classes like `IndexCreationContext`, where we merged the contextual
information with index specific configurations. This is again incompatible with
having index types defined in runtime. For example, we would need to create a
method like `IndexCreationContext.Common.forBloomFilter()` for each new index
type, which would require to modify the source code each time a new index is
added.
`TableConfig` is not in fact very bad designed. The actual problem is that
we are using `TableConfig` for everything. What this proposal does is to keep
`TableConfig` as the concrete syntax specific object that is usually
instantiated by Jackson when a table is deserialized from JSON. In parallel,
each `IndexType` has a generic `C` (which stands for _config_) and defines how
to transform a `TableConfig` into an instance of `C`
Therefore all the dirty code we have to introduce in order to support
historical ways to configure the index is written only once and in the same
method for each `IndexType` instead of being spread across the code. This
should produce code that is easier to read and to maintain. Methods that are
used to instantiate IndexCreators, IndexHandlers or IndexReaders will be
explained bellow, but they do not receive the TableConfig (or a wrapper like
`IndexLoadingConfig`) as an argument. Instead, they receive the already
prepared generic C object they expect.
With the change explained above we have the ability to specify all
`IndexType` methods without depending on the specially crafted interfaces like
`IndexLoadingConfig`, `SegmentGeneratorConfig` or `IndexCreationContext`. But
it doesn't explain how to open the `TableConfig` to be able to store indexes
that are unknown at compile time. The proposal here is to modify a little bit
the idea of configuring all indexes in `FieldConfig`. Instead of list active
indexes in the field `indexTypes`, here we propose to create a new field called
`indexes`. That field is going to be either null or a JSON object whose keys
are the id of the configured indexes and the values are defined by each
`IndexType`. To implement that, we propose the following:
- `FieldConfig.indexes` will be of type `JsonNode`, which is the most
abstract Jackson type. Anything that can be deserialized with Jackson can be
deserialized into a `JsonNode`.
- The class `IndexType` will have a method that returns the unique id of the
index (for example, `ForwardIndexType.getId()` can return "forward").
- The class `IndexType` will also have a method that given a `JsonNode`,
returns an instance of C.
When Pinot requires to transform `FieldConfig.indexes` from `JsonNode` into
proper configuration objects, it will do something like the following:
```
if (fieldConfig.indexes.isNull) {
return null;
}
if (fieldConfig.indexes.isObject) {
Iterator<Map.Entry<String, JsonNode>> it = node.fields();
Map<IndexType, Object> builder = new HashMap<>();
while (it.hasNext()) {
Map.Entry<String, JsonNode> entry = it.next();
String indexId = entry.getKey();
IndexType indexType = IndexService.getInstance().getIndexType(indexId);
if (indexType == null) {
// decide what to do: continue or fail
}
JsonNode indexNode = entry.getValue();
Object deserialized = indexType.deserialize(indexNode);
builder.put(indexType, deserialized);
}
}
```
By doing that, we could be sure that if we do something like
`map.get(BloomIndexType.INSTACE)` we will receive a `BloomFilterConfig`. The
draft includes a intermediate class called `FieldIndexConfigs` which is a
typesafe wrapper on top of a map like the one shown here.
So, for example, this configuration configuration that is currently
supported by Pinot:
```js
{
"tableIndexConfig": {
"rangeIndexColumns": [
"col1"
],
"bloomFilterConfigs": {
"col1": {
"fpp": 0.01,
"maxSizeInBytes": 1000000,
"loadOnHeap": true
}
},
"invertedIndexColumns": [
"col1"
]
},
"fieldConfigList": [
{
"name": “col1”,
"indexTypes": [
“text”
],
"properties": {
"fstType": “lucene”,
"stopWordsInclude": "a,b,c",
"enableQueryCacheForTextIndex": “false”,
"isSegmentPartitioned": “true” // this property is not related to
an specific index
}
}
]
}
```
Could be implemented as:
```js
{
"fieldConfigList": [
{
"name": “col1”,
"indexes": {
"inverted": true, // or {enabled: true}
"range": true, // or {enabled: true}
"text": {
"type": “lucene”,
"enableQueryCacheForTextIndex": false,
"stopWordsInclude": ["a", "b", "c"]
},
"bloom": {
"fpp": 0.01,
"maxSizeInBytes": 1000000,
"loadOnHeap": true
},
},
"properties": {
"isSegmentPartitioned": “true” // this property is not related to
an specific index
}
}
]
}
```
In cases like BloomFilter, when `BloomFilterConfig` is already
serializable/deserializable with Jackson, the method that transform from JSON
to object could be:
```java
BloomFilterConfig deserialize(JsonNode node)
throws IOException {
return JsonUtils.jsonNodeToObject(node, getIndexConfigClass());
}
```
Assuming that Java `null` means that the index is not configured, in case we
need to customize that, for example on inverted indexes when there is no
argument, we can do the same:
```java
public Boolean deserialize(JsonNode node)
throws IOException {
if (node.isBoolean()) {
if (node.booleanValue()) {
return Boolean.TRUE;
} else {
return null;
}
}
throw ...
}
```
In case we want to support different ways to configure the index, for
example RangeIndexType when we would like to either use `true` to use the
default or an object to be able to indicate the version of the range index, we
can do something like:
```java
public RangeIndexConfig deserialize(JsonNode node)
throws IOException {
if (node.isBoolean()) {
if (node.booleanValue()) {
return getDefaultConfig();
} else {
return null;
}
}
return JsonUtils.jsonNodeToObject(node, getIndexConfigClass());
}
```
Most of these methods can also be delegated to Jackson by using customized
deserializers, but we think that having this method here is more explicit and
requires less knowledge of Jackson, which could make it easier to read.
This proposal makes this new way to configure indexes optional to indexes in
Apache Pinot repository but mandatory for indexes outside this repo. Therefore
the proposal must support the _older_ way to configure indexes. This can
partially be done by the already mentioned method that transform from
`TableConfig` to `C`, but we may also need other methods to transform from
`IndexLoadingConfig` and `SegmentGeneratorConfig` to `C`, as they are usually
modified by tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]