[jira] [Comment Edited] (HIVE-11245) LLAP: Fix the LLAP to ORC APIs

Sergey Shelukhin (JIRA) Tue, 11 Aug 2015 18:23:51 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-11245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692715#comment-14692715
 ]


Sergey Shelukhin edited comment on HIVE-11245 at 8/12/15 1:22 AM:
------------------------------------------------------------------

Most of the work was done in 3 sub-tasks.
1) 3 groups of things were added to storage API.
* DiskRange; ORC already depends on it, so it was an oversight on master that 
it was not moved to storage-api. It has been moved on llap branch.
* EncodedColumnBatch and MemoryBuffer. Same as moving VRB and *ColumnVector for 
encoded data.
* DataCache, Pool and Allocator APIs (the only import in any of them is 
MemoryBuffer, so they are very generic). The right place to implement 
format-agnostic cache, allocator, and object pool is Hive, and input formats 
can use these deep inside the core functionality, where Hive has no insight. 
Therefore it makes sense to have connective interfaces.

2) ....orc.encoded package was created with full separate path for "record 
reader", as discussed, although I don't think it was necessary. That required 
making some things in RecordReaderUtils, etc. public because Java visibility 
model is stupid.
It contains 9 files, most of which are very small.
* EncodedOrcFile - equivalent to OrcFile, static factory for Reader.
* Reader - interface, equivalent to orc.Reader, produces EncodedReader.
* EncodedReader - interface, equivalent to RecordReader (although not in 
signatures), for reading encoded data.
* Consumer - interface used in EncodedReader call to return data asynchronously 
(logically, a queue for returned data with "done" and "error" markers).
* OrcBatchKey, OrcCacheKey - simple DSes to use as keys when passing data and 
for cache.
* ReaderImpl - equivalent to orc.ReaderImpl, the Reader interface 
implementation.
* EncodedReaderImpl - equivalent to RecordReaderImpl (although not in 
signatures), main class that contains the code. Package-private, so it's not 
even visible.
* CacheChunk - part of EncodedReaderImpl that has to be visible for tests, so 
it's in separate file.

3) The remaining item is moving TreeReader bits that depend on orc.encoded 
package, into encoded package. Myself or [~prasanth_j] can do this.


was (Author: sershe):
Most of the work was done in 3 sub-tasks.
1) 3 groups of things were added to storage API.
a) DiskRange; ORC already depends on it, so it was an oversight on master that 
it was not moved to storage-api. It has been moved on llap branch.
b) EncodedColumnBatch and MemoryBuffer. Same as moving VRB and *ColumnVector 
for encoded data.
c) DataCache, Pool and Allocator APIs (the only import in any of them is 
MemoryBuffer, so they are very generic). The right place to implement 
format-agnostic cache, allocator, and object pool is Hive, and input formats 
can use these deep inside the core functionality, where Hive has no insight. 
Therefore it makes sense to have connective interfaces.

2) ....orc.encoded package was created with full separate path for "record 
reader", as discussed, although I don't think it was necessary. That required 
making some things in RecordReaderUtils, etc. public because Java visibility 
model is stupid.
It contains 9 files, most of which are very small.
* EncodedOrcFile - equivalent to OrcFile, static factory for Reader.
* Reader - interface, equivalent to orc.Reader, produces EncodedReader.
* EncodedReader - interface, equivalent to RecordReader (although not in 
signatures), for reading encoded data.
* Consumer - interface used in EncodedReader call to return data asynchronously 
(logically, a queue for returned data with "done" and "error" markers).
* OrcBatchKey, OrcCacheKey - simple DSes to use as keys when passing data and 
for cache.
* ReaderImpl - equivalent to orc.ReaderImpl, the Reader interface 
implementation.
* EncodedReaderImpl - equivalent to RecordReaderImpl (although not in 
signatures), main class that contains the code. Package-private, so it's not 
even visible.
* CacheChunk - part of EncodedReaderImpl that has to be visible for tests, so 
it's in separate file.

3) The remaining item is moving TreeReader bits that depend on orc.encoded 
package, into encoded package. Myself or [~prasanth_j] can do this.

> LLAP: Fix the LLAP to ORC APIs
> ------------------------------
>
>                 Key: HIVE-11245
>                 URL: https://issues.apache.org/jira/browse/HIVE-11245
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Owen O'Malley
>            Assignee: Sergey Shelukhin
>            Priority: Blocker
>
> Currently the LLAP branch has refactored the ORC code to have different code 
> paths depending on whether the data is coming from the cache or a FileSystem.
> We need to introduce a concept of a DataSource that is responsible for 
> getting the necessary bytes regardless of whether they are coming from a 
> FileSystem, in memory cache, or both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-11245) LLAP: Fix the LLAP to ORC APIs

Reply via email to