[jira] [Resolved] (IMPALA-9068) Impala should respect the distinction between the managed warehouse and the external warehouse

2020-01-24 Thread Joe McDonnell (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-9068.
---
Fix Version/s: Impala 3.4.0
   Resolution: Fixed

> Impala should respect the distinction between the managed warehouse and the 
> external warehouse
> --
>
> Key: IMPALA-9068
> URL: https://issues.apache.org/jira/browse/IMPALA-9068
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 3.4.0
>Reporter: Joe McDonnell
>Priority: Blocker
> Fix For: Impala 3.4.0
>
>
> Recent Hive 3 makes a distinction between the directory for managed tables 
> and the directory for external tables.
> {code:java}
> WAREHOUSE("metastore.warehouse.dir", "hive.metastore.warehouse.dir", 
> "/user/hive/warehouse",
> "location of default database for the warehouse"),
> WAREHOUSE_EXTERNAL("metastore.warehouse.external.dir",
> "hive.metastore.warehouse.external.dir", "",
> "Default location for external tables created in the warehouse. " +
> "If not set or null, then the normal warehouse location will be used 
> as the default location."),
> {code}
> With HIVE-22189, Hive is militantly enforcing the distinction. It no longer 
> allows external tables in the hive.metastore.warehouse.dir (the managed 
> directory). The create table statements are currently translated to create 
> external table statements with appropriate table properties, but in order for 
> this to work correctly, we need to specify 
> hive.metastore.warehouse.external.dir to be different from 
> hive.metastore.warehouse.dir. A sensible approach is to set 
> hive.metastore.warehouse.external.dir to /test-warehouse and change 
> hive.metastore.warehouse.dir to something else, like /test-warehouse-managed.
> This will require further changes in our test infrastructure to incorporate 
> this distinction. For example, tests/comparison/cluster/cluster.py's 
> warehouse_dir needs to handle this appropriately (this is needed for 
> testdata/bin/load_nested.py). It may also require changes to some paths for 
> tests that use managed tables.
> hive.metastore.warehouse.external.dir does not exist in Hive 2, so this will 
> require some Hive 3 specific logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IMPALA-9327) Data cache should be able to borrow spill-to-disk space

2020-01-24 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9327:


 Summary: Data cache should be able to borrow spill-to-disk space
 Key: IMPALA-9327
 URL: https://issues.apache.org/jira/browse/IMPALA-9327
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Reporter: Sahil Takiar


Currently, users typically allocate a fixed amount of space for the data cache 
and spill-to-disk using the configuration options {{--data_cache}} and 
{{--scratch_dirs}}. For example, {{data_cache=/impala/cache:200GB}} and 
{{scratch_dirs=/impala/scratch:200GB}}.

The issue with this type of static configuration is if there are no queries 
that spill to disk, then that 200GB reserved for the scratch space will be 
un-used. It would improve Impala performance and resource utilization if the 
data cache was able to steal disk space from the spill-to-disk manager. The 
space could be returned (e.g. data from the cache is evicted) to the 
spill-to-disk manager, when required.

Users don't have to put a limit on the data_cache / scratch_dirs size (e.g. if 
the 200GB was omitted Impala would just write files until there is no more disk 
space left). The problem here is that there is no fairness policy between the 
data cache and scratch space. What will likely happen is that the data cache 
will consume all the disk capacity, leaving none for the scratch space. Impala 
needs to have logic that allows disk space stealing, but still enforces a 
minimum reserved disk capacity.

One issue with this approach is predictability. When queries start spilling, 
performance can potentially be impacted since data from the cache will be 
evicted. In practice, this may not be a big deal, especially since the data 
evicted from the cache will be the least recently used data. However, we should 
still think through the tradeoffs between predictability vs. performance / 
utilization, and think of ways to expose metrics indicating that spill-to-disk 
is taking space away from the data cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)