[jira] [Created] (IMPALA-12139) Add end-to-end tests for HDFS caching with Parquet page indexes, etc.

Joe McDonnell (Jira) Fri, 12 May 2023 10:18:27 -0700

Joe McDonnell created IMPALA-12139:
--------------------------------------

             Summary: Add end-to-end tests for HDFS caching with Parquet page 
indexes, etc.
                 Key: IMPALA-12139
                 URL: https://issues.apache.org/jira/browse/IMPALA-12139
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend, Infrastructure
    Affects Versions: Impala 4.3.0
            Reporter: Joe McDonnell



In a recent bug, we found issues with how HDFS caching interacts with Parquet 
page indexes (IMPALA-12123). This was diagnosed by creating a table with a 
Parquet file with page indexes and enabling HDFS caching. This is a very useful 
test scenario, and this would also be true for all other file formats and the 
scanner fuzzing tests.

e limiting factor is that HDFS caching requires the ability to lock memory, and 
the amount of locked memory is limited on Linux for security reasons. By 
default, the limit is 64KB

 
{noformat}
# -l    the maximum size a process may lock into memory
$ ulimit -l
65536{noformat}
HDFS configuration specifies the max size of locked memory in 
hdfs-site.xml.tmpl:

 
{noformat}
<!-- Set the max cached memory to ~64kb. This must be less than ulimit -l -->
  <property>
    <name>dfs.datanode.max.locked.memory</name>
    <value>64000</value>
  </property>{noformat}
A 64KB limit means that HDFS caching is unreliable and/or impossible with 
normal sized Parquet files. We can do the "alter table foo set cached in 
'testPool'" but it may not actually get cached.

 

To set these to a higher value, we can set the following in 
/etc/security/limits.conf and get a new user session:

 
{noformat}
* hard memlock unlimited
* soft memlock unlimited{noformat}
Then, we can bump the dfs.datanode.max.locked.memory to a much larger size. 
With this larger size, the caching operations are more reliable and we could 
create end-to-end tests.

To get true HDFS caching end-to-end tests, we will need to configure this, 
possibly in bin/bootstrap_system.sh, possibly as preexisting configuration for 
Jenkins workers.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-12139) Add end-to-end tests for HDFS caching with Parquet page indexes, etc.

Reply via email to