markap14 commented on a change in pull request #4282: URL: https://github.com/apache/nifi/pull/4282#discussion_r428138798
########## File path: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/queryrecord/FlowFileTable.java ########## @@ -223,12 +225,69 @@ private RelDataType getRelDataType(final DataType fieldType, final JavaTypeFacto case BIGINT: return typeFactory.createJavaType(BigInteger.class); case CHOICE: + final ChoiceDataType choiceDataType = (ChoiceDataType) fieldType; + DataType widestDataType = choiceDataType.getPossibleSubTypes().get(0); + for (final DataType possibleType : choiceDataType.getPossibleSubTypes()) { + if (possibleType == widestDataType) { + continue; + } + if (possibleType.getFieldType().isWiderThan(widestDataType.getFieldType())) { + widestDataType = possibleType; + continue; + } + if (widestDataType.getFieldType().isWiderThan(possibleType.getFieldType())) { + continue; + } + + // Neither is wider than the other. + widestDataType = null; + break; + } + + // If one of the CHOICE data types is the widest, use it. + if (widestDataType != null) { + return getRelDataType(widestDataType, typeFactory); + } + + // None of the data types is strictly the widest. Check if all data types are numeric. + // This would happen, for instance, if the data type is a choice between float and integer. + // If that is the case, we can use a String type for the table schema because all values will fit + // into a String. This will still allow for casting, etc. if the query requires it. + boolean allNumeric = true; + for (final DataType possibleType : choiceDataType.getPossibleSubTypes()) { + if (!isNumeric(possibleType)) { + allNumeric = false; + break; + } + } + + if (allNumeric) { Review comment: @pcgrenier I do see your argument. But I think I disagree. Let's say that we have the following CSV then: ``` name, other markap14, 48 pcgrenier, 19 ``` And the data's schema indicates that "other" is a CHOICE between STRING or INT. Fundamentally, it comes down to one question: in this case, should we allow the user to run a query like `SELECT SUM(other) as total FROM FLOWFILE`. Your argument is yes, because I happen to know that for this particular FlowFile it will succeed. My argument, though, is that we should not allow it. Yes, it would succeed for this FlowFile, but what will happen is that it will fail for other FlowFiles. Other FlowFiles that have the same schema and other FlowFiles for which the data is exactly correct according to the schema. This makes things more confusing because some FlowFiles succeed and others fail. All the while, the schema clearly indicates that the data may be a STRING, and you shouldn't really be able to sum together STRING data. If you do sort out all the number values, it makes sense to use a schema that reflects that. Otherwise, if we allow summing together a CHOICE[int, string] we're not really honoring the schema, in my point of view. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org