[jira] [Commented] (IMPALA-7265) Cache remote file handles
[ https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775404#comment-16775404 ] ASF subversion and git services commented on IMPALA-7265: - Commit dce82e4e018d1944ff19bb6f87139b51c1b0287e in impala's branch refs/heads/master from Joe McDonnell [ https://gitbox.apache.org/repos/asf?p=impala.git;h=dce82e4 ] IMPALA-8178: Disable file handle cache for HDFS erasure coded files Testing on an erasure coded minicluster has revealed that each file handle for an erasure coded files uses about 3MB of native memory. This shows up as "java.nio:type=BufferPool,name=direct" in the /jmx endpoint (here showing the output when 608 handles are open): { "name": "java.nio:type=BufferPool,name=direct", "modelerType": "sun.management.ManagementFactoryHelper$1", "Name": "direct", "TotalCapacity": 1921048960, "MemoryUsed": 1921048961, "Count": 633, "ObjectName": "java.nio:type=BufferPool,name=direct" } The memory is not released or reduced by a call to unbuffer(), so these file handles are not suitable for long term caching. HDFS-14308 tracks the implementation of unbuffer() for DFSStripedInputStream. This issue showed up when remote file handle caching was enabled in IMPALA-7265, as erasure coded files are always scheduled to be remote (IMPALA-7019). This disables file handle caching for erasure coded files, which requires plumbing through the information about which ScanRanges are accessing erasure coded files. With this change, core tests pass on an erasure coded system. Change-Id: I8c761e08aacc952de0033a4c91e07f15c8ec96da Reviewed-on: http://gerrit.cloudera.org:8080/12552 Reviewed-by: Joe McDonnell Tested-by: Impala Public Jenkins > Cache remote file handles > - > > Key: IMPALA-7265 > URL: https://issues.apache.org/jira/browse/IMPALA-7265 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 3.1.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > Fix For: Impala 3.2.0 > > > The file handle cache currently does not allow caching remote file handles. > This means that clusters that have a lot of remote reads can suffer from > overloading the NameNode. Impala should be able to cache remote file handles. > There are some open questions about remote file handles and whether they > behave differently from local file handles. In particular: > # Is there any resource constraint on the number of remote file handles > open? (e.g. do they maintain a network connection?) > # Are there any semantic differences in how remote file handles behave when > files are deleted, overwritten, or appended? > # Are there any extra failure cases for remote file handles? (i.e. if a > machine goes down or a remote file handle is left open for an extended period > of time) > The form of caching will depend on the answers, but at the very least, it > should be possible to cache a remote file handle at the level of a query so > that a Parquet file with multiple columns can share file handles. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7265) Cache remote file handles
[ https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763093#comment-16763093 ] Joe McDonnell commented on IMPALA-7265: --- [~arodoni_cloudera] Right, this is not an incompatible change. > Cache remote file handles > - > > Key: IMPALA-7265 > URL: https://issues.apache.org/jira/browse/IMPALA-7265 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 3.1.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > Fix For: Impala 3.2.0 > > > The file handle cache currently does not allow caching remote file handles. > This means that clusters that have a lot of remote reads can suffer from > overloading the NameNode. Impala should be able to cache remote file handles. > There are some open questions about remote file handles and whether they > behave differently from local file handles. In particular: > # Is there any resource constraint on the number of remote file handles > open? (e.g. do they maintain a network connection?) > # Are there any semantic differences in how remote file handles behave when > files are deleted, overwritten, or appended? > # Are there any extra failure cases for remote file handles? (i.e. if a > machine goes down or a remote file handle is left open for an extended period > of time) > The form of caching will depend on the answers, but at the very least, it > should be possible to cache a remote file handle at the level of a query so > that a Parquet file with multiple columns can share file handles. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7265) Cache remote file handles
[ https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763079#comment-16763079 ] Alex Rodoni commented on IMPALA-7265: - [~joemcdonnell] Since cache_remote_file_handles is a new flag, this new behavior does not have to categorized as an "incompatible change", right? > Cache remote file handles > - > > Key: IMPALA-7265 > URL: https://issues.apache.org/jira/browse/IMPALA-7265 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 3.1.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > Fix For: Impala 3.2.0 > > > The file handle cache currently does not allow caching remote file handles. > This means that clusters that have a lot of remote reads can suffer from > overloading the NameNode. Impala should be able to cache remote file handles. > There are some open questions about remote file handles and whether they > behave differently from local file handles. In particular: > # Is there any resource constraint on the number of remote file handles > open? (e.g. do they maintain a network connection?) > # Are there any semantic differences in how remote file handles behave when > files are deleted, overwritten, or appended? > # Are there any extra failure cases for remote file handles? (i.e. if a > machine goes down or a remote file handle is left open for an extended period > of time) > The form of caching will depend on the answers, but at the very least, it > should be possible to cache a remote file handle at the level of a query so > that a Parquet file with multiple columns can share file handles. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7265) Cache remote file handles
[ https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762367#comment-16762367 ] ASF subversion and git services commented on IMPALA-7265: - Commit 255ec4687ebe6195b20e5566394f3692c07e3b7f in impala's branch refs/heads/master from Joe McDonnell [ https://gitbox.apache.org/repos/asf?p=impala.git;h=255ec46 ] IMPALA-7265: Enable caching of remote file handles by default This changes the default value of cache_remote_file_handles from false to true. Testing shows that this setting has a major impact on performance for clusters that do remote HDFS reads. Hand testing of the cache did not reveal any problems with the semantics of caching remote file handles. Change-Id: I2fc4a69c6bf721017f4adcdc302db9eace5135a4 Reviewed-on: http://gerrit.cloudera.org:8080/12387 Reviewed-by: Philip Zeyliger Tested-by: Impala Public Jenkins > Cache remote file handles > - > > Key: IMPALA-7265 > URL: https://issues.apache.org/jira/browse/IMPALA-7265 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 3.1.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > > The file handle cache currently does not allow caching remote file handles. > This means that clusters that have a lot of remote reads can suffer from > overloading the NameNode. Impala should be able to cache remote file handles. > There are some open questions about remote file handles and whether they > behave differently from local file handles. In particular: > # Is there any resource constraint on the number of remote file handles > open? (e.g. do they maintain a network connection?) > # Are there any semantic differences in how remote file handles behave when > files are deleted, overwritten, or appended? > # Are there any extra failure cases for remote file handles? (i.e. if a > machine goes down or a remote file handle is left open for an extended period > of time) > The form of caching will depend on the answers, but at the very least, it > should be possible to cache a remote file handle at the level of a query so > that a Parquet file with multiple columns can share file handles. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7265) Cache remote file handles
[ https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760373#comment-16760373 ] Joe McDonnell commented on IMPALA-7265: --- [~arodoni_cloudera] The cache_remote_file_handles parameter is merged with the default value of false. This Jira is open to track setting the default to true. > Cache remote file handles > - > > Key: IMPALA-7265 > URL: https://issues.apache.org/jira/browse/IMPALA-7265 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 3.1.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > > The file handle cache currently does not allow caching remote file handles. > This means that clusters that have a lot of remote reads can suffer from > overloading the NameNode. Impala should be able to cache remote file handles. > There are some open questions about remote file handles and whether they > behave differently from local file handles. In particular: > # Is there any resource constraint on the number of remote file handles > open? (e.g. do they maintain a network connection?) > # Are there any semantic differences in how remote file handles behave when > files are deleted, overwritten, or appended? > # Are there any extra failure cases for remote file handles? (i.e. if a > machine goes down or a remote file handle is left open for an extended period > of time) > The form of caching will depend on the answers, but at the very least, it > should be possible to cache a remote file handle at the level of a query so > that a Parquet file with multiple columns can share file handles. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7265) Cache remote file handles
[ https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760368#comment-16760368 ] Alex Rodoni commented on IMPALA-7265: - [~joemcdonnell] Could you confirm that this will be in Rel 3.2? The code is merged, but this ticket is still in progress. > Cache remote file handles > - > > Key: IMPALA-7265 > URL: https://issues.apache.org/jira/browse/IMPALA-7265 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 3.1.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > > The file handle cache currently does not allow caching remote file handles. > This means that clusters that have a lot of remote reads can suffer from > overloading the NameNode. Impala should be able to cache remote file handles. > There are some open questions about remote file handles and whether they > behave differently from local file handles. In particular: > # Is there any resource constraint on the number of remote file handles > open? (e.g. do they maintain a network connection?) > # Are there any semantic differences in how remote file handles behave when > files are deleted, overwritten, or appended? > # Are there any extra failure cases for remote file handles? (i.e. if a > machine goes down or a remote file handle is left open for an extended period > of time) > The form of caching will depend on the answers, but at the very least, it > should be possible to cache a remote file handle at the level of a query so > that a Parquet file with multiple columns can share file handles. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7265) Cache remote file handles
[ https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749340#comment-16749340 ] Alex Rodoni commented on IMPALA-7265: - [~joemcdonnell] Is there a doc impact for this feature? > Cache remote file handles > - > > Key: IMPALA-7265 > URL: https://issues.apache.org/jira/browse/IMPALA-7265 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 3.1.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > > The file handle cache currently does not allow caching remote file handles. > This means that clusters that have a lot of remote reads can suffer from > overloading the NameNode. Impala should be able to cache remote file handles. > There are some open questions about remote file handles and whether they > behave differently from local file handles. In particular: > # Is there any resource constraint on the number of remote file handles > open? (e.g. do they maintain a network connection?) > # Are there any semantic differences in how remote file handles behave when > files are deleted, overwritten, or appended? > # Are there any extra failure cases for remote file handles? (i.e. if a > machine goes down or a remote file handle is left open for an extended period > of time) > The form of caching will depend on the answers, but at the very least, it > should be possible to cache a remote file handle at the level of a query so > that a Parquet file with multiple columns can share file handles. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7265) Cache remote file handles
[ https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729924#comment-16729924 ] ASF subversion and git services commented on IMPALA-7265: - Commit a3eb5fa90cf721e82a6f5d0aa7edf217be7ef3a1 in impala's branch refs/heads/master from Joe McDonnell [ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=a3eb5fa ] IMPALA-7265: Add parameter to cache remote HDFS file handles Currently, the file handle cache does not apply to remote HDFS files. This adds a parameter 'cache_remote_file_handles' that enables the file handle cache for remote HDFS files. It is currently being tested, so it is set to false by default. This does not change the behavior for S3, ADLS, or ABFS. Change-Id: I549f007432a01ca52fa8093d458a220bba02e1d9 Reviewed-on: http://gerrit.cloudera.org:8080/12111 Tested-by: Impala Public Jenkins Reviewed-by: Philip Zeyliger > Cache remote file handles > - > > Key: IMPALA-7265 > URL: https://issues.apache.org/jira/browse/IMPALA-7265 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 3.1.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > > The file handle cache currently does not allow caching remote file handles. > This means that clusters that have a lot of remote reads can suffer from > overloading the NameNode. Impala should be able to cache remote file handles. > There are some open questions about remote file handles and whether they > behave differently from local file handles. In particular: > # Is there any resource constraint on the number of remote file handles > open? (e.g. do they maintain a network connection?) > # Are there any semantic differences in how remote file handles behave when > files are deleted, overwritten, or appended? > # Are there any extra failure cases for remote file handles? (i.e. if a > machine goes down or a remote file handle is left open for an extended period > of time) > The form of caching will depend on the answers, but at the very least, it > should be possible to cache a remote file handle at the level of a query so > that a Parquet file with multiple columns can share file handles. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org