[jira] [Commented] (IMPALA-7265) Cache remote file handles

2019-02-22 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775404#comment-16775404
 ] 

ASF subversion and git services commented on IMPALA-7265:
-

Commit dce82e4e018d1944ff19bb6f87139b51c1b0287e in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=dce82e4 ]

IMPALA-8178: Disable file handle cache for HDFS erasure coded files

Testing on an erasure coded minicluster has revealed that each
file handle for an erasure coded files uses about 3MB of native
memory. This shows up as "java.nio:type=BufferPool,name=direct"
in the /jmx endpoint (here showing the output when 608 handles
are open):

{
  "name": "java.nio:type=BufferPool,name=direct",
  "modelerType": "sun.management.ManagementFactoryHelper$1",
  "Name": "direct",
  "TotalCapacity": 1921048960,
  "MemoryUsed": 1921048961,
  "Count": 633,
  "ObjectName": "java.nio:type=BufferPool,name=direct"
}

The memory is not released or reduced by a call to unbuffer(),
so these file handles are not suitable for long term caching.
HDFS-14308 tracks the implementation of unbuffer() for
DFSStripedInputStream. This issue showed up when remote
file handle caching was enabled in IMPALA-7265, as erasure
coded files are always scheduled to be remote (IMPALA-7019).

This disables file handle caching for erasure coded files,
which requires plumbing through the information about which
ScanRanges are accessing erasure coded files.

With this change, core tests pass on an erasure coded system.

Change-Id: I8c761e08aacc952de0033a4c91e07f15c8ec96da
Reviewed-on: http://gerrit.cloudera.org:8080/12552
Reviewed-by: Joe McDonnell 
Tested-by: Impala Public Jenkins 


> Cache remote file handles
> -
>
> Key: IMPALA-7265
> URL: https://issues.apache.org/jira/browse/IMPALA-7265
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.1.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Critical
> Fix For: Impala 3.2.0
>
>
> The file handle cache currently does not allow caching remote file handles. 
> This means that clusters that have a lot of remote reads can suffer from 
> overloading the NameNode. Impala should be able to cache remote file handles.
> There are some open questions about remote file handles and whether they 
> behave differently from local file handles. In particular:
>  # Is there any resource constraint on the number of remote file handles 
> open? (e.g. do they maintain a network connection?)
>  # Are there any semantic differences in how remote file handles behave when 
> files are deleted, overwritten, or appended?
>  # Are there any extra failure cases for remote file handles? (i.e. if a 
> machine goes down or a remote file handle is left open for an extended period 
> of time)
> The form of caching will depend on the answers, but at the very least, it 
> should be possible to cache a remote file handle at the level of a query so 
> that a Parquet file with multiple columns can share file handles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7265) Cache remote file handles

2019-02-07 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763093#comment-16763093
 ] 

Joe McDonnell commented on IMPALA-7265:
---

[~arodoni_cloudera] Right, this is not an incompatible change.

> Cache remote file handles
> -
>
> Key: IMPALA-7265
> URL: https://issues.apache.org/jira/browse/IMPALA-7265
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.1.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Critical
> Fix For: Impala 3.2.0
>
>
> The file handle cache currently does not allow caching remote file handles. 
> This means that clusters that have a lot of remote reads can suffer from 
> overloading the NameNode. Impala should be able to cache remote file handles.
> There are some open questions about remote file handles and whether they 
> behave differently from local file handles. In particular:
>  # Is there any resource constraint on the number of remote file handles 
> open? (e.g. do they maintain a network connection?)
>  # Are there any semantic differences in how remote file handles behave when 
> files are deleted, overwritten, or appended?
>  # Are there any extra failure cases for remote file handles? (i.e. if a 
> machine goes down or a remote file handle is left open for an extended period 
> of time)
> The form of caching will depend on the answers, but at the very least, it 
> should be possible to cache a remote file handle at the level of a query so 
> that a Parquet file with multiple columns can share file handles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7265) Cache remote file handles

2019-02-07 Thread Alex Rodoni (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763079#comment-16763079
 ] 

Alex Rodoni commented on IMPALA-7265:
-

[~joemcdonnell] Since cache_remote_file_handles is a new flag, this new 
behavior does not have to categorized as an "incompatible change", right?

> Cache remote file handles
> -
>
> Key: IMPALA-7265
> URL: https://issues.apache.org/jira/browse/IMPALA-7265
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.1.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Critical
> Fix For: Impala 3.2.0
>
>
> The file handle cache currently does not allow caching remote file handles. 
> This means that clusters that have a lot of remote reads can suffer from 
> overloading the NameNode. Impala should be able to cache remote file handles.
> There are some open questions about remote file handles and whether they 
> behave differently from local file handles. In particular:
>  # Is there any resource constraint on the number of remote file handles 
> open? (e.g. do they maintain a network connection?)
>  # Are there any semantic differences in how remote file handles behave when 
> files are deleted, overwritten, or appended?
>  # Are there any extra failure cases for remote file handles? (i.e. if a 
> machine goes down or a remote file handle is left open for an extended period 
> of time)
> The form of caching will depend on the answers, but at the very least, it 
> should be possible to cache a remote file handle at the level of a query so 
> that a Parquet file with multiple columns can share file handles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7265) Cache remote file handles

2019-02-06 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762367#comment-16762367
 ] 

ASF subversion and git services commented on IMPALA-7265:
-

Commit 255ec4687ebe6195b20e5566394f3692c07e3b7f in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=255ec46 ]

IMPALA-7265: Enable caching of remote file handles by default

This changes the default value of cache_remote_file_handles
from false to true. Testing shows that this setting has a
major impact on performance for clusters that do remote HDFS
reads. Hand testing of the cache did not reveal any problems
with the semantics of caching remote file handles.

Change-Id: I2fc4a69c6bf721017f4adcdc302db9eace5135a4
Reviewed-on: http://gerrit.cloudera.org:8080/12387
Reviewed-by: Philip Zeyliger 
Tested-by: Impala Public Jenkins 


> Cache remote file handles
> -
>
> Key: IMPALA-7265
> URL: https://issues.apache.org/jira/browse/IMPALA-7265
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.1.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Critical
>
> The file handle cache currently does not allow caching remote file handles. 
> This means that clusters that have a lot of remote reads can suffer from 
> overloading the NameNode. Impala should be able to cache remote file handles.
> There are some open questions about remote file handles and whether they 
> behave differently from local file handles. In particular:
>  # Is there any resource constraint on the number of remote file handles 
> open? (e.g. do they maintain a network connection?)
>  # Are there any semantic differences in how remote file handles behave when 
> files are deleted, overwritten, or appended?
>  # Are there any extra failure cases for remote file handles? (i.e. if a 
> machine goes down or a remote file handle is left open for an extended period 
> of time)
> The form of caching will depend on the answers, but at the very least, it 
> should be possible to cache a remote file handle at the level of a query so 
> that a Parquet file with multiple columns can share file handles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7265) Cache remote file handles

2019-02-04 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760373#comment-16760373
 ] 

Joe McDonnell commented on IMPALA-7265:
---

[~arodoni_cloudera] The cache_remote_file_handles parameter is merged with the 
default value of false. This Jira is open to track setting the default to true.

> Cache remote file handles
> -
>
> Key: IMPALA-7265
> URL: https://issues.apache.org/jira/browse/IMPALA-7265
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.1.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Critical
>
> The file handle cache currently does not allow caching remote file handles. 
> This means that clusters that have a lot of remote reads can suffer from 
> overloading the NameNode. Impala should be able to cache remote file handles.
> There are some open questions about remote file handles and whether they 
> behave differently from local file handles. In particular:
>  # Is there any resource constraint on the number of remote file handles 
> open? (e.g. do they maintain a network connection?)
>  # Are there any semantic differences in how remote file handles behave when 
> files are deleted, overwritten, or appended?
>  # Are there any extra failure cases for remote file handles? (i.e. if a 
> machine goes down or a remote file handle is left open for an extended period 
> of time)
> The form of caching will depend on the answers, but at the very least, it 
> should be possible to cache a remote file handle at the level of a query so 
> that a Parquet file with multiple columns can share file handles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7265) Cache remote file handles

2019-02-04 Thread Alex Rodoni (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760368#comment-16760368
 ] 

Alex Rodoni commented on IMPALA-7265:
-

[~joemcdonnell] Could you confirm that this will be in Rel 3.2? The code is 
merged, but this ticket is still in progress. 

> Cache remote file handles
> -
>
> Key: IMPALA-7265
> URL: https://issues.apache.org/jira/browse/IMPALA-7265
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.1.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Critical
>
> The file handle cache currently does not allow caching remote file handles. 
> This means that clusters that have a lot of remote reads can suffer from 
> overloading the NameNode. Impala should be able to cache remote file handles.
> There are some open questions about remote file handles and whether they 
> behave differently from local file handles. In particular:
>  # Is there any resource constraint on the number of remote file handles 
> open? (e.g. do they maintain a network connection?)
>  # Are there any semantic differences in how remote file handles behave when 
> files are deleted, overwritten, or appended?
>  # Are there any extra failure cases for remote file handles? (i.e. if a 
> machine goes down or a remote file handle is left open for an extended period 
> of time)
> The form of caching will depend on the answers, but at the very least, it 
> should be possible to cache a remote file handle at the level of a query so 
> that a Parquet file with multiple columns can share file handles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7265) Cache remote file handles

2019-01-22 Thread Alex Rodoni (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749340#comment-16749340
 ] 

Alex Rodoni commented on IMPALA-7265:
-

[~joemcdonnell] Is there a doc impact for this feature?

> Cache remote file handles
> -
>
> Key: IMPALA-7265
> URL: https://issues.apache.org/jira/browse/IMPALA-7265
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.1.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Critical
>
> The file handle cache currently does not allow caching remote file handles. 
> This means that clusters that have a lot of remote reads can suffer from 
> overloading the NameNode. Impala should be able to cache remote file handles.
> There are some open questions about remote file handles and whether they 
> behave differently from local file handles. In particular:
>  # Is there any resource constraint on the number of remote file handles 
> open? (e.g. do they maintain a network connection?)
>  # Are there any semantic differences in how remote file handles behave when 
> files are deleted, overwritten, or appended?
>  # Are there any extra failure cases for remote file handles? (i.e. if a 
> machine goes down or a remote file handle is left open for an extended period 
> of time)
> The form of caching will depend on the answers, but at the very least, it 
> should be possible to cache a remote file handle at the level of a query so 
> that a Parquet file with multiple columns can share file handles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7265) Cache remote file handles

2018-12-27 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729924#comment-16729924
 ] 

ASF subversion and git services commented on IMPALA-7265:
-

Commit a3eb5fa90cf721e82a6f5d0aa7edf217be7ef3a1 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=a3eb5fa ]

IMPALA-7265: Add parameter to cache remote HDFS file handles

Currently, the file handle cache does not apply to remote HDFS
files. This adds a parameter 'cache_remote_file_handles' that
enables the file handle cache for remote HDFS files. It is
currently being tested, so it is set to false by default.
This does not change the behavior for S3, ADLS, or ABFS.

Change-Id: I549f007432a01ca52fa8093d458a220bba02e1d9
Reviewed-on: http://gerrit.cloudera.org:8080/12111
Tested-by: Impala Public Jenkins 
Reviewed-by: Philip Zeyliger 


> Cache remote file handles
> -
>
> Key: IMPALA-7265
> URL: https://issues.apache.org/jira/browse/IMPALA-7265
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 3.1.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Critical
>
> The file handle cache currently does not allow caching remote file handles. 
> This means that clusters that have a lot of remote reads can suffer from 
> overloading the NameNode. Impala should be able to cache remote file handles.
> There are some open questions about remote file handles and whether they 
> behave differently from local file handles. In particular:
>  # Is there any resource constraint on the number of remote file handles 
> open? (e.g. do they maintain a network connection?)
>  # Are there any semantic differences in how remote file handles behave when 
> files are deleted, overwritten, or appended?
>  # Are there any extra failure cases for remote file handles? (i.e. if a 
> machine goes down or a remote file handle is left open for an extended period 
> of time)
> The form of caching will depend on the answers, but at the very least, it 
> should be possible to cache a remote file handle at the level of a query so 
> that a Parquet file with multiple columns can share file handles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org