[GitHub] [jackrabbit-oak] thomasmueller commented on a diff in pull request #649: OAK-9875 - Prefix query on a Long, analyzed field fails when executed over Elastic

GitBox Thu, 18 Aug 2022 09:03:05 -0700


thomasmueller commented on code in PR #649:
URL: https://github.com/apache/jackrabbit-oak/pull/649#discussion_r949296531



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticIndexHelper.java:
##########
@@ -207,21 +222,57 @@ private static void mapIndexRules(ElasticIndexDefinition 
indexDefinition, XConte
                     useInSuggest = true;
                 }
             }
-
+            /* TODO: (Add JIRA issue # once available) Potential future 
optimization:
+             *  If a property has propertyIndex=false, we could in principle 
create an ES mapping with index=false to
+             *  avoid creating an index for this property, thus saving time 
and space. However, there is an issue
+             *  with function indexes, which incorrectly rely on the index 
being created even in the case where
+             *  propertyIndex=false. Once this issue is solved, revise the 
logic below to set index=false in the ES
+             *  field if propertyIndex=false.
+             */
             mappingBuilder.startObject(name);
             {
                 // 
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
                 if (Type.BINARY.equals(type)) {
                     mappingBuilder.field("type", "binary");
+                    // ignore_malformed not allowed for binary fields but also 
not needed as there is no type conversion
+                    if (indexDefinition.isAnalyzed(propertyDefinitions)) {
+                        createNestedAnalyzed(mappingBuilder);

Review Comment:
   I wonder if we could move the code to _after_ the if .. else if ... list, to 
avoid repetition:
   
   ```
                       if (indexDefinition.isAnalyzed(propertyDefinitions)) {
                           createNestedAnalyzed(mappingBuilder);
                       }
   
   ```
   



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticIndexHelper.java:
##########
@@ -207,21 +222,57 @@ private static void mapIndexRules(ElasticIndexDefinition 
indexDefinition, XConte
                     useInSuggest = true;
                 }
             }
-
+            /* TODO: (Add JIRA issue # once available) Potential future 
optimization:
+             *  If a property has propertyIndex=false, we could in principle 
create an ES mapping with index=false to
+             *  avoid creating an index for this property, thus saving time 
and space. However, there is an issue
+             *  with function indexes, which incorrectly rely on the index 
being created even in the case where
+             *  propertyIndex=false. Once this issue is solved, revise the 
logic below to set index=false in the ES
+             *  field if propertyIndex=false.
+             */
             mappingBuilder.startObject(name);
             {
                 // 
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
                 if (Type.BINARY.equals(type)) {
                     mappingBuilder.field("type", "binary");
+                    // ignore_malformed not allowed for binary fields but also 
not needed as there is no type conversion
+                    if (indexDefinition.isAnalyzed(propertyDefinitions)) {
+                        createNestedAnalyzed(mappingBuilder);
+                    }
                 } else if (Type.LONG.equals(type)) {
                     mappingBuilder.field("type", "long");
+                    mappingBuilder.field("ignore_malformed", true);
+                    if (indexDefinition.isAnalyzed(propertyDefinitions)) {
+                        createNestedAnalyzed(mappingBuilder);
+                    }
                 } else if (Type.DOUBLE.equals(type) || 
Type.DECIMAL.equals(type)) {
                     mappingBuilder.field("type", "double");
+                    mappingBuilder.field("ignore_malformed", true);
+                    if (indexDefinition.isAnalyzed(propertyDefinitions)) {
+                        createNestedAnalyzed(mappingBuilder);
+                    }
                 } else if (Type.DATE.equals(type)) {
                     mappingBuilder.field("type", "date");
+                    mappingBuilder.field("ignore_malformed", true);
+                    if (indexDefinition.isAnalyzed(propertyDefinitions)) {
+                        createNestedAnalyzed(mappingBuilder);
+                    }
                 } else if (Type.BOOLEAN.equals(type)) {
                     mappingBuilder.field("type", "boolean");
+                    // ES does not support "ignore_malformed" on boolean 
properties, so all values added to the index
+                    // must parse as valid booleans or else the document is 
rejected. We therefore also do not support
+                    // the nested full text field, because to do so we would 
need to provide a value for the top level
+                    // field, which must be a valid value. As in this case we 
cannot set ignore_malformed=true to still
+                    // index as full-text the invalid values, we cannot have a 
full-text index. This breaks compatibility
+                    // with Lucene, which accepts any value, even if it is not 
a valid boolean, but this is unlikely to
+                    // have much impact because setting analyzed=true in a 
boolean property should have no real world

Review Comment:
   Yes, I think we don't have any such cases currently, by running
   
   `    grep -B 2 -A 2 -R --include="*.json" "Boolean" data | grep -B 2 -A 2 
"analyzed" `
   
   But we should probably document this as an incompatibility in 
https://jackrabbit.apache.org/oak/docs/query/elastic.html



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/ElasticIndexDefinition.java:
##########
@@ -220,6 +222,35 @@ public String getElasticKeyword(String propertyName) {
         return field;
     }
 
+    public String getElasticTextField(String propertyName) {
+        List<PropertyDefinition> propertyDefinitions = 
propertiesByName.get(propertyName);
+        if (propertyDefinitions == null) {
+            return propertyName;
+        }
+
+        Type<?> type = null;
+        for (PropertyDefinition pd : propertyDefinitions) {
+            type = Type.fromTag(pd.getType(), false);
+        }
+
+        if (isAnalyzed(propertyDefinitions)) {
+            // The full text index for String properties is <propertyName>, 
while for non-string properties is part of
+            // the multi-field and is called <propertyName>.text
+            if (Type.BINARY.equals(type) ||

Review Comment:
   Isn't BINARY also indexed as text?



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticIndexHelper.java:
##########
@@ -207,21 +222,57 @@ private static void mapIndexRules(ElasticIndexDefinition 
indexDefinition, XConte
                     useInSuggest = true;
                 }
             }
-
+            /* TODO: (Add JIRA issue # once available) Potential future 
optimization:
+             *  If a property has propertyIndex=false, we could in principle 
create an ES mapping with index=false to
+             *  avoid creating an index for this property, thus saving time 
and space. However, there is an issue
+             *  with function indexes, which incorrectly rely on the index 
being created even in the case where
+             *  propertyIndex=false. Once this issue is solved, revise the 
logic below to set index=false in the ES
+             *  field if propertyIndex=false.
+             */
             mappingBuilder.startObject(name);
             {
                 // 
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
                 if (Type.BINARY.equals(type)) {
                     mappingBuilder.field("type", "binary");
+                    // ignore_malformed not allowed for binary fields but also 
not needed as there is no type conversion
+                    if (indexDefinition.isAnalyzed(propertyDefinitions)) {
+                        createNestedAnalyzed(mappingBuilder);
+                    }
                 } else if (Type.LONG.equals(type)) {
                     mappingBuilder.field("type", "long");
+                    mappingBuilder.field("ignore_malformed", true);
+                    if (indexDefinition.isAnalyzed(propertyDefinitions)) {
+                        createNestedAnalyzed(mappingBuilder);
+                    }
                 } else if (Type.DOUBLE.equals(type) || 
Type.DECIMAL.equals(type)) {
                     mappingBuilder.field("type", "double");
+                    mappingBuilder.field("ignore_malformed", true);
+                    if (indexDefinition.isAnalyzed(propertyDefinitions)) {
+                        createNestedAnalyzed(mappingBuilder);
+                    }
                 } else if (Type.DATE.equals(type)) {
                     mappingBuilder.field("type", "date");
+                    mappingBuilder.field("ignore_malformed", true);
+                    if (indexDefinition.isAnalyzed(propertyDefinitions)) {
+                        createNestedAnalyzed(mappingBuilder);
+                    }
                 } else if (Type.BOOLEAN.equals(type)) {
                     mappingBuilder.field("type", "boolean");
+                    // ES does not support "ignore_malformed" on boolean 
properties, so all values added to the index
+                    // must parse as valid booleans or else the document is 
rejected. We therefore also do not support
+                    // the nested full text field, because to do so we would 
need to provide a value for the top level
+                    // field, which must be a valid value. As in this case we 
cannot set ignore_malformed=true to still
+                    // index as full-text the invalid values, we cannot have a 
full-text index. This breaks compatibility
+                    // with Lucene, which accepts any value, even if it is not 
a valid boolean, but this is unlikely to
+                    // have much impact because setting analyzed=true in a 
boolean property should have no real world
+                    // use case.
+                    if (indexDefinition.isAnalyzed(propertyDefinitions)) {
+                        LOG.warn("Property {} of type boolean does not support 
analyzed=true.", name);
+                    }
                 } else {
+                    // OAK-9875 - The ES field with the same name as the 
property should not be used for the full text
+                    // index, it should be used only for the keyword index. 
The full text index is done separately below.
+                    // We preserve the old mapping in case we find some 
problem in production with the new mapping.

Review Comment:
   Not sure if I understand... so we keep this code to provide backward 
compatibility for old oak versions, during the upgrade of the existing indexes 
(to support blue-green deployments), plus for the case that there are problems? 
That sounds reasonable... We can then remove it later if everything goes fine.
   
   If that's our plan, then what adding a new oak issue to remove it later, and 
in this patch add a constant "static final boolean 
BACKWARD_COMPATIBLE_MAPPING_OAK_... = true", so that we can track removal of 
this code? Hiding the code behind "if (BACKWARD_COMPATIBLE_MAPPING_OAK_...)" to 
make it obvious that this code will go away soon...



##########
oak-search-elastic/src/test/java/org/apache/jackrabbit/oak/plugins/index/elastic/ElasticAbstractQueryTest.java:
##########
@@ -140,7 +145,7 @@ protected ContentRepository createRepository() {
         ElasticIndexTracker indexTracker = new 
ElasticIndexTracker(esConnection, getMetricHandler());
         ElasticIndexEditorProvider editorProvider = new 
ElasticIndexEditorProvider(indexTracker, esConnection,
                 new ExtractedTextCache(10 * FileUtils.ONE_MB, 100));
-        ElasticIndexProvider indexProvider = new 
ElasticIndexProvider(indexTracker);
+        ElasticIndexProvider indexProvider = new 
ElasticIndexProvider(indexTracker, separateFullTextSearchESFieldFeature);

Review Comment:
   If we want to add feature flags (now or later), then I think we will end up 
with many of them at some point. So, when we have more than one, we should 
probably have a wrapper class "ElasticConfiguration" that wraps the 
configuration (all feature flags). That way, when we add one more, we don't 
need to change the constructor again - and will never get 10 boolean parameters 
:-)



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/query/async/ElasticResultRowAsyncIterator.java:
##########
@@ -341,8 +343,16 @@ public void onFailure(Throwable t) {
                 LOG.warn("Error reference for async iterator was previously 
set to {}. It has now been reset to new error {}", error.getMessage(), 
t.getMessage());
             }
 
+            if (t instanceof ElasticsearchException) {
+                ElasticsearchException esException = (ElasticsearchException) 
t;
+                StringBuffer errorMessage = new StringBuffer("Status: " + 
esException.status() + ", Message: " + esException.getMessage());
+                for (ErrorCause c : 
esException.response().error().rootCause()) {

Review Comment:
   The javadoc 
https://javadoc.io/doc/co.elastic.clients/elasticsearch-java/latest/co/elastic/clients/elasticsearch/_types/ElasticsearchException.html
 doesn't say, but could response() return null? For error() the javadoc says 
"Required - API name: error", so it shouldn't be possible (but then, there is 
no guarantee it might change), and root cause might also return null? ("API 
name: root_cause")



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [jackrabbit-oak] thomasmueller commented on a diff in pull request #649: OAK-9875 - Prefix query on a Long, analyzed field fails when executed over Elastic

Reply via email to