Hi Tim,>* Assuming we end up using the ORC C++ library, we probably want to manage >it in the same way that we do Avro by building it externally and then >linking against it (we use the native-toolchain project for convenience). >Importing the code seems OK at least for the initial review though.Linking the >ORC library outside looks more elegant. However, I think it might loss some >oppotunitis for further optimization. Looks like every system has it's own orc >reader actually. For example, Presto has a module for ORC: >https://github.com/prestodb/presto/tree/master/presto-orc>* It sucks that the >ORC library can't handle allocation failures >gracefully. I need to think about it more. One option is to contribute >fixes back to the ORC project. The ORC library will only allocate memory throught orc::MemoryPool::malloc. This is the function we implemented. We can throw an exception inside this function for allocation failures to stop the reader, and catch the exception outside the reader to handle it. This is our current solution, to be able to not change any codes of the reader. >* This is definitely going to conflict with >https://gerrit.cloudera.org/#/c/8966/ (use reservations for scans). I don't >know how deterministic the ORC readers memory requirements are, but if it >assumes it can allocate an arbitrary amount of memory, then it may be >problematic. We can track and controll the memory consumption in the implemetation of orc::MemoryPool. So I think this is safe.
Finally, since ORC is another columnar file format much like Parquet, we hope the orc-reader can be merged into Impala organically. That is, the parquet-scanner and orc-scanner can share many logics, instead of using totally indenpend code path. For example, the codes of run-length decoding can be shared with both scanners. We can also handle the allocation failures inside the orc-reader more elegantly. This patch which merges ORC library codes inside Impala is just a prototype to prove that it can work correctly with modules in Impala. We can start from it to make it merged organically. At 2018-01-27 01:02:03, "Tim Armstrong" <[email protected]> wrote: >Thank you! > >I had few higher-level questions or thoughts: > >* Assuming we end up using the ORC C++ library, we probably want to manage >it in the same way that we do Avro by building it externally and then >linking against it (we use the native-toolchain project for convenience). >Importing the code seems OK at least for the initial review though. >* It sucks that the ORC library can't handle allocation failures >gracefully. I need to think about it more. One option is to contribute >fixes back to the ORC project. >* This is definitely going to conflict with >https://gerrit.cloudera.org/#/c/8966/ (use reservations for scans). I don't >know how deterministic the ORC readers memory requirements are, but if it >assumes it can allocate an arbitrary amount of memory, then it may be >problematic. I'll have to think about this. I don't necessarily think you >should have to do the work to switch the ORC scanner to the new paradigm. >One would be to merge the ORC reader to a branch post-code-review and then >I could do the necessary work to switch it to the new paradigm (since I've >already done it for all the other scanners). > >On Fri, Jan 26, 2018 at 3:46 AM, Quanlong Huang <[email protected]> >wrote: > >> Hi friends, >> >> I'm very excited that our ORC-support patch has passed all the tests and >> is ready for review! To ease your work, we wrote a brief document about our >> solution: https://docs.google.com/document/d/1Lg-MmZIis- >> ZbmMf6cD8YJq4x2tM0UXYPyzf0AYqe6Gc >> >> In short, we integrated the mature orc-reader (https://github.com/apache/ >> orc/tree/master/c%2B%2B) into Impala. As a first step, we only support >> reading primitive types. >> >> Finally, here is the review link: https://gerrit.cloudera.org/#/c/9134/ >> >> Hope this feature can be accepted! Any feedback is welcome! >> >> Thanks, >> Quanlong Huang >> Hulu >> >> >> >>
