Joe McDonnell created IMPALA-12139: -------------------------------------- Summary: Add end-to-end tests for HDFS caching with Parquet page indexes, etc. Key: IMPALA-12139 URL: https://issues.apache.org/jira/browse/IMPALA-12139 Project: IMPALA Issue Type: Improvement Components: Backend, Infrastructure Affects Versions: Impala 4.3.0 Reporter: Joe McDonnell
In a recent bug, we found issues with how HDFS caching interacts with Parquet page indexes (IMPALA-12123). This was diagnosed by creating a table with a Parquet file with page indexes and enabling HDFS caching. This is a very useful test scenario, and this would also be true for all other file formats and the scanner fuzzing tests. e limiting factor is that HDFS caching requires the ability to lock memory, and the amount of locked memory is limited on Linux for security reasons. By default, the limit is 64KB {noformat} # -l the maximum size a process may lock into memory $ ulimit -l 65536{noformat} HDFS configuration specifies the max size of locked memory in hdfs-site.xml.tmpl: {noformat} <!-- Set the max cached memory to ~64kb. This must be less than ulimit -l --> <property> <name>dfs.datanode.max.locked.memory</name> <value>64000</value> </property>{noformat} A 64KB limit means that HDFS caching is unreliable and/or impossible with normal sized Parquet files. We can do the "alter table foo set cached in 'testPool'" but it may not actually get cached. To set these to a higher value, we can set the following in /etc/security/limits.conf and get a new user session: {noformat} * hard memlock unlimited * soft memlock unlimited{noformat} Then, we can bump the dfs.datanode.max.locked.memory to a much larger size. With this larger size, the caching operations are more reliable and we could create end-to-end tests. To get true HDFS caching end-to-end tests, we will need to configure this, possibly in bin/bootstrap_system.sh, possibly as preexisting configuration for Jenkins workers. -- This message was sent by Atlassian Jira (v8.20.10#820010)