[ 
https://issues.apache.org/jira/browse/HIVE-23729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ádám Szita updated HIVE-23729:
------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed
           Status: Resolved  (was: Patch Available)

Committed to master, thanks for the review Olli and Peter!

> LLAP text cache fails when using multiple tables/schemas on the same files
> --------------------------------------------------------------------------
>
>                 Key: HIVE-23729
>                 URL: https://issues.apache.org/jira/browse/HIVE-23729
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Ádám Szita
>            Assignee: Ádám Szita
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When using the text based cache we will hit exceptions in the following case:
>  * Table A with 3 columns is defined on location X (where we have text based 
> data files)
>  * Table B with 2 columns is defined on the same location X
>  * User runs a query on table A, thereby filling the LLAP cache.
>  * If the next query goes against table B that has a different schema, LLAP 
> will throw an error:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
>  at 
> org.apache.hadoop.hive.llap.cache.SerDeLowLevelCacheImpl.getCacheDataForOneSlice(SerDeLowLevelCacheImpl.java:411)
>  at 
> org.apache.hadoop.hive.llap.cache.SerDeLowLevelCacheImpl.getFileData(SerDeLowLevelCacheImpl.java:389)
>  at 
> org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.readFileWithCache(SerDeEncodedDataReader.java:819)
>  at 
> org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.performDataRead(SerDeEncodedDataReader.java:720)
>  at 
> org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:274)
>  at 
> org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:271)
>  {code}
> This is because the cache lookup is based on file ID, which in this case is 
> the same for both tables. However, unlike with ORC files, the cached content 
> and the file content is different, as it is dependent on the schema that was 
> defined by the user. That's because the original text content is encoded into 
> ORC in the cache.
> I think for the text cache case we will need to extend the cache key from 
> being just the simple file ID to something that tracks the schema too. This 
> will result in caching the *same* *file* *content* multiple times (if there 
> are multiple schemas like this), however as we can see the *cached content 
> itself could be quite different* (e.g. different streams with different 
> encodings), and in turn we gain correctness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to