Hi Tim,>* Assuming we end up using the ORC C++ library, we probably want to 
manage
>it in the same way that we do Avro by building it externally and then
>linking against it (we use the native-toolchain project for convenience).
>Importing the code seems OK at least for the initial review though.Linking the 
>ORC library outside looks more elegant. However, I think it might loss some 
>oppotunitis for further optimization. Looks like every system has it's own orc 
>reader actually. For example, Presto has a module for ORC: 
>https://github.com/prestodb/presto/tree/master/presto-orc>* It sucks that the 
>ORC library can't handle allocation failures
>gracefully. I need to think about it more. One option is to contribute
>fixes back to the ORC project.
The ORC library will only allocate memory throught orc::MemoryPool::malloc. 
This is the function we implemented. We can throw an exception inside this 
function for allocation failures to stop the reader, and catch the exception 
outside the reader to handle it. This is our current solution, to be able to 
not change any codes of the reader.
>* This is definitely going to conflict with
>https://gerrit.cloudera.org/#/c/8966/ (use reservations for scans). I don't
>know how deterministic the ORC readers memory requirements are, but if it
>assumes it can allocate an arbitrary amount of memory, then it may be
>problematic.
We can track and controll the memory consumption in the implemetation of 
orc::MemoryPool. So I think this is safe.


Finally, since ORC is another columnar file format much like Parquet, we hope 
the orc-reader can be merged into Impala organically. That is, the 
parquet-scanner and orc-scanner can share many logics, instead of using totally 
indenpend code path. For example, the codes of run-length decoding can be 
shared with both scanners. We can also handle the allocation failures inside 
the orc-reader more elegantly.


This patch which merges ORC library codes inside Impala is just a prototype to 
prove that it can work correctly with modules in Impala. We can start from it 
to make it merged organically.


At 2018-01-27 01:02:03, "Tim Armstrong" <[email protected]> wrote: >Thank 
you! > >I had few higher-level questions or thoughts: > >* Assuming we end up 
using the ORC C++ library, we probably want to manage >it in the same way that 
we do Avro by building it externally and then >linking against it (we use the 
native-toolchain project for convenience). >Importing the code seems OK at 
least for the initial review though. >* It sucks that the ORC library can't 
handle allocation failures >gracefully. I need to think about it more. One 
option is to contribute >fixes back to the ORC project. >* This is definitely 
going to conflict with >https://gerrit.cloudera.org/#/c/8966/ (use reservations 
for scans). I don't >know how deterministic the ORC readers memory requirements 
are, but if it >assumes it can allocate an arbitrary amount of memory, then it 
may be >problematic. I'll have to think about this. I don't necessarily think 
you >should have to do the work to switch the ORC scanner to the new paradigm. 
>One would be to merge the ORC reader to a branch post-code-review and then >I 
could do the necessary work to switch it to the new paradigm (since I've 
>already done it for all the other scanners). > >On Fri, Jan 26, 2018 at 3:46 
AM, Quanlong Huang <[email protected]> >wrote: > >> Hi friends, >> >> I'm 
very excited that our ORC-support patch has passed all the tests and >> is 
ready for review! To ease your work, we wrote a brief document about our >> 
solution: https://docs.google.com/document/d/1Lg-MmZIis- >> 
ZbmMf6cD8YJq4x2tM0UXYPyzf0AYqe6Gc >> >> In short, we integrated the mature 
orc-reader (https://github.com/apache/ >> orc/tree/master/c%2B%2B) into Impala. 
As a first step, we only support >> reading primitive types. >> >> Finally, 
here is the review link: https://gerrit.cloudera.org/#/c/9134/ >> >> Hope this 
feature can be accepted! Any feedback is welcome! >> >> Thanks, >> Quanlong 
Huang >> Hulu >> >> >> >>

Reply via email to