clintropolis opened a new issue #7134: overhaul 'druid-orc-extensions' and move from 'contrib' to 'core' URL: https://github.com/apache/incubator-druid/issues/7134 # Motivation The existing ORC hadoop indexing extension, while functional, is difficult to use and hard to troubleshoot, and does not provide a consistent experience compared to indexing other nested data formats such as JSON, Avro, and Parquet. I thnk it would also be nice if this were a 'core' extension to help Druid better tie into the rest of the Apache ecosystem, and since we provide 'core' extensions for Avro and Parquet, this would help round out our 'core' support for popular Hadoop file formats. # Proposed changes I propose replacing the existing 'contrib' extension with a new 'core' extension that is modelled after the parquet and avro hadoop indexing facilities, with support for `flattenSpec`, and built on the Apache ORC `orc-mapreduce` instead of using Hive's `hive-exec`. Instead of requiring the user to specify the schema in the form of the `typeString` parameter that the existing extension uses, the new extension will auto-detect the schema using the schema encoded in the `TypeDescription` of `OrcStruct`. Structurally it will look a lot like the Parquet extension, with an implementation of `JsonProvider` and `ObjectFlatteners.FlattenerMaker<OrcStruct>` to provide flattening for the `InputRowParser` implementation, and a 'converter' utilty class to translate the `OrcStruct` and it's fields into a map to ultimately feed an to the `InputRowParser` implementations internal `MapInputRowParser`. Of note is that this new extension will not be exactly compatible with the current extension, so ingest specs will likely need changed, but output should be able to continue to produce data that matches any existing Druid schemas produced from the current extension. The primary incompatibility is that the `typeString` of the current extension allows arbitrary renaming of columns in the ORC file, as only position and type seem to be significant. I'm not certain but I presume the reason this is allowed is detailed in these Hive issues [HIVE-7189](https://issues.apache.org/jira/browse/HIVE-7189) and [HIVE-4243](https://issues.apache.org/jira/browse/HIVE-4243) which chronicle how Hive would write ORC files without their real column names, instead just using `_col0`, `_col1`, etc. However, `flattenSpec` expressions would be a way to handle this with the new extension, as the fields could be extracted from the generic name `_col0` or whatever into the desired column name manually. If we feel that we really need to continue to support the old way the extension worked, I could investigate possible mechanisms to retain this functionality of providing a `typeString` schema and extracting column names from it, but I don't feel that it is necessary. Another incompatibility would be related to how the current extension handles `OrcMap` types. It provides a type of flattening 'magic' for maps of primitives that appear in the row with a name controlled by `mapFieldNameFormat`. Since the new extension would use `flattenSpec`, these names could be replicated to preserve existing Druid schemas with field extraction expressions. Alternatively, we could always just rebrand the existing 'contrib' extension as `druid-hive-orc-extension` and leave it as is otherwise so existing users that perfer the existing behavior can continue as is, but that might be confusing to have multiple ORC extensions. # Rationale The ORC format supports nested data, so should use the `flattenSpec` approach to provide a consistent experience for users. By implementing this extension in the same pattern that is used elsewhere, it should also help reduce maintenance costs on our end. Additionally, using the Apache ORC libraries instead of Hive libraries "feels" more correct, even if many of these files may be coming from Hive. # Operational impact The ORC extension will now be a 'core' extension instead of 'contrib', so will now be bundled in the binary packaging. Ingestion specs will need updated to use the new extension, which could be as minimum would require adjusting the `inputFormat` to use `org.apache.orc.mapreduce.OrcInputFormat` instead of `org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat` with the remaining work centered on migrating away from using `typeString` auto column renaming (if we don't provide a replacement), or if users were relying on the current extensions `OrcMap` handling facilities, to migrate to using a `flattenSpec` to extract those properties. If we choose instead to retain the existing 'contrib' extension and rebrand it, the operation impact would be changing the extension load list configuration. # Test plan A variety of example ORC files will be sourced to used to comprehensively unit test the extension and ensure that flattening and field discovery works as expected. Additionally, testing in a live cluster will be performed to ensure that everything is fully operational. # Future work I think there is potentially room to refactor these flattening input row parsers and extract out one or more base types, since the `parseBatch` method implementation is always of the form: ``` mapParser.parseBatch(flattner.flatten(objectToFlatten)) ``` so future work is looking into ways to simplify and re-use code where possible to reduce maintenence overhead on our end.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org