[GitHub] clintropolis opened a new issue #7134: overhaul 'druid-orc-extensions' and move from 'contrib' to 'core'

GitBox Sat, 23 Feb 2019 17:25:36 -0800

clintropolis opened a new issue #7134: overhaul 'druid-orc-extensions' and move 
from 'contrib' to 'core'
URL: https://github.com/apache/incubator-druid/issues/7134
 
 
   # Motivation
   
   The existing ORC hadoop indexing extension, while functional, is difficult 
to use and hard to troubleshoot, and does not provide a consistent experience 
compared to indexing other nested data formats such as JSON, Avro, and Parquet.
   
   I thnk it would also be nice if this were a 'core' extension to help Druid 
better tie into the rest of the Apache ecosystem, and since we provide 'core' 
extensions for Avro and Parquet, this would help round out our 'core' support 
for popular Hadoop file formats.
   
   # Proposed changes
   
   I propose replacing the existing 'contrib' extension with a new 'core' 
extension that is modelled after the parquet and avro hadoop indexing 
facilities, with support for `flattenSpec`, and built on the Apache ORC 
`orc-mapreduce` instead of using Hive's `hive-exec`. Instead of requiring the 
user to specify the schema in the form of the `typeString` parameter that the 
existing extension uses, the new extension will auto-detect the schema using 
the schema encoded in the `TypeDescription` of `OrcStruct`.
   
   Structurally it will look a lot like the Parquet extension, with an 
implementation of `JsonProvider` and 
`ObjectFlatteners.FlattenerMaker<OrcStruct>` to provide flattening for the 
`InputRowParser` implementation, and a 'converter' utilty class to translate 
the `OrcStruct` and it's fields into a map to ultimately feed an to the 
`InputRowParser` implementations internal `MapInputRowParser`.
   
   Of note is that this new extension will not be exactly compatible with the 
current extension, so ingest specs will likely need changed, but output should 
be able to continue to produce data that matches any existing Druid schemas 
produced from the current extension. 
   
   The primary incompatibility is that the `typeString` of the current 
extension allows arbitrary renaming of columns in the ORC file, as only 
position and type seem to be significant. I'm not certain but I presume the 
reason this is allowed is detailed in these Hive issues 
[HIVE-7189](https://issues.apache.org/jira/browse/HIVE-7189) and 
[HIVE-4243](https://issues.apache.org/jira/browse/HIVE-4243) which chronicle 
how Hive would write ORC files without their real column names, instead just 
using `_col0`, `_col1`, etc. However, `flattenSpec` expressions would be a way 
to handle this with the new extension, as the fields could be extracted from 
the generic name `_col0` or whatever into the desired column name manually. If 
we feel that we really need to continue to support the old way the extension 
worked, I could investigate possible mechanisms to retain this functionality of 
providing a `typeString` schema and extracting column names from it, but I 
don't feel that it is necessary. 
   
   Another incompatibility would be related to how the current extension 
handles `OrcMap` types. It provides a type of flattening 'magic' for maps of 
primitives that appear in the row with a name controlled by 
`mapFieldNameFormat`. Since the new extension would use `flattenSpec`, these 
names could be replicated to preserve existing Druid schemas with field 
extraction expressions.
   
   Alternatively, we could always just rebrand the existing 'contrib' extension 
as `druid-hive-orc-extension` and leave it as is otherwise so existing users 
that perfer the existing behavior can continue as is, but that might be 
confusing to have multiple ORC extensions.
   
   # Rationale
   
   The ORC format supports nested data, so should use the `flattenSpec` 
approach to provide a consistent experience for users. By implementing this 
extension in the same pattern that is used elsewhere, it should also help 
reduce maintenance costs on our end. Additionally, using the Apache ORC 
libraries instead of Hive libraries "feels" more correct, even if many of these 
files may be coming from Hive.
   
   # Operational impact
   
   The ORC extension will now be a 'core' extension instead of 'contrib', so 
will now be bundled in the binary packaging. Ingestion specs will need updated 
to use the new extension, which could be as minimum would require adjusting the 
`inputFormat` to use `org.apache.orc.mapreduce.OrcInputFormat` instead of 
`org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat` with the remaining work 
centered on migrating away from using `typeString` auto column renaming (if we 
don't provide a replacement), or if users were relying on the current 
extensions `OrcMap` handling facilities, to migrate to using a `flattenSpec` to 
extract those properties.
   
   If we choose instead to retain the existing 'contrib' extension and rebrand 
it, the operation impact would be changing the extension load list 
configuration.
   
   # Test plan
   
   A variety of example ORC files will be sourced to used to comprehensively 
unit test the extension and ensure that flattening and field discovery works as 
expected. 
   
   Additionally, testing in a live cluster will be performed to ensure that 
everything is fully operational.
   
   # Future work
   
   I think there is potentially room to refactor these flattening input row 
parsers and extract out one or more base types, since the `parseBatch` method 
implementation is always of the form: 
   
   ```
   mapParser.parseBatch(flattner.flatten(objectToFlatten))
   ```
   
   so future work is looking into ways to simplify and re-use code where 
possible to reduce maintenence overhead on our end.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] clintropolis opened a new issue #7134: overhaul 'druid-orc-extensions' and move from 'contrib' to 'core'

Reply via email to