[jira] [Updated] (HDFS-6698) try to optimize DFSInputStream.getFileLength()
[ https://issues.apache.org/jira/browse/HDFS-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-6698: --- Labels: BB2015-05-TBR (was: ) > try to optimize DFSInputStream.getFileLength() > -- > > Key: HDFS-6698 > URL: https://issues.apache.org/jira/browse/HDFS-6698 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.0.0 >Reporter: Liang Xie >Assignee: Liang Xie > Labels: BB2015-05-TBR > Attachments: HDFS-6698.txt, HDFS-6698.txt, HDFS-6698v2.txt, > HDFS-6698v2.txt, HDFS-6698v3.txt > > > HBase prefers to invoke read() serving scan request, and invoke pread() > serving get reqeust. Because pread() almost holds no lock. > Let's image there's a read() running, because the definition is: > {code} > public synchronized int read > {code} > so no other read() request could run concurrently, this is known, but pread() > also could not run... because: > {code} > public int read(long position, byte[] buffer, int offset, int length) > throws IOException { > // sanity checks > dfsClient.checkOpen(); > if (closed) { > throw new IOException("Stream closed"); > } > failures = 0; > long filelen = getFileLength(); > {code} > the getFileLength() also needs lock. so we need to figure out a no lock impl > for getFileLength() before HBase multi stream feature done. > [~saint@gmail.com] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6698) try to optimize DFSInputStream.getFileLength()
[ https://issues.apache.org/jira/browse/HDFS-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Hofhansl updated HDFS-6698: Attachment: HDFS-6698v3.txt I just ran into this as well while debugging why HBase does not benefit from Snappy compression as much as it should. Turns out a non-trivial amount of time (as determined by a sampler, not a instrumenting profiler) is spent in this method. To be safe I'd probably also turn LocatedBlocks into an immutable object (well, except for blocks) - see attached patch. All members of LocatedBlocks are safely published now. With that I don't think this patch can do any harm. > try to optimize DFSInputStream.getFileLength() > -- > > Key: HDFS-6698 > URL: https://issues.apache.org/jira/browse/HDFS-6698 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.0.0 >Reporter: Liang Xie >Assignee: Liang Xie > Attachments: HDFS-6698.txt, HDFS-6698.txt, HDFS-6698v2.txt, > HDFS-6698v2.txt, HDFS-6698v3.txt > > > HBase prefers to invoke read() serving scan request, and invoke pread() > serving get reqeust. Because pread() almost holds no lock. > Let's image there's a read() running, because the definition is: > {code} > public synchronized int read > {code} > so no other read() request could run concurrently, this is known, but pread() > also could not run... because: > {code} > public int read(long position, byte[] buffer, int offset, int length) > throws IOException { > // sanity checks > dfsClient.checkOpen(); > if (closed) { > throw new IOException("Stream closed"); > } > failures = 0; > long filelen = getFileLength(); > {code} > the getFileLength() also needs lock. so we need to figure out a no lock impl > for getFileLength() before HBase multi stream feature done. > [~saint@gmail.com] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6698) try to optimize DFSInputStream.getFileLength()
[ https://issues.apache.org/jira/browse/HDFS-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HDFS-6698: Attachment: HDFS-6698v2.txt Rebase. Retry. > try to optimize DFSInputStream.getFileLength() > -- > > Key: HDFS-6698 > URL: https://issues.apache.org/jira/browse/HDFS-6698 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.0.0 >Reporter: Liang Xie >Assignee: Liang Xie > Attachments: HDFS-6698.txt, HDFS-6698.txt, HDFS-6698v2.txt, > HDFS-6698v2.txt > > > HBase prefers to invoke read() serving scan request, and invoke pread() > serving get reqeust. Because pread() almost holds no lock. > Let's image there's a read() running, because the definition is: > {code} > public synchronized int read > {code} > so no other read() request could run concurrently, this is known, but pread() > also could not run... because: > {code} > public int read(long position, byte[] buffer, int offset, int length) > throws IOException { > // sanity checks > dfsClient.checkOpen(); > if (closed) { > throw new IOException("Stream closed"); > } > failures = 0; > long filelen = getFileLength(); > {code} > the getFileLength() also needs lock. so we need to figure out a no lock impl > for getFileLength() before HBase multi stream feature done. > [~saint@gmail.com] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6698) try to optimize DFSInputStream.getFileLength()
[ https://issues.apache.org/jira/browse/HDFS-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HDFS-6698: Attachment: HDFS-6698v2.txt v2 adds protection against the scenario Colin suggests (though it can't happen w/ code as is). This patch is conservative. It does not change semantic. It just livens up the getting of file length by keeping a cached copy which it will return unless anything has changed since we last did file length. Discussion on locking and concurrency on DFSIS in general is going on over in other issues at levels above where this patch is working. > try to optimize DFSInputStream.getFileLength() > -- > > Key: HDFS-6698 > URL: https://issues.apache.org/jira/browse/HDFS-6698 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.0.0 >Reporter: Liang Xie >Assignee: Liang Xie > Attachments: HDFS-6698.txt, HDFS-6698.txt, HDFS-6698v2.txt > > > HBase prefers to invoke read() serving scan request, and invoke pread() > serving get reqeust. Because pread() almost holds no lock. > Let's image there's a read() running, because the definition is: > {code} > public synchronized int read > {code} > so no other read() request could run concurrently, this is known, but pread() > also could not run... because: > {code} > public int read(long position, byte[] buffer, int offset, int length) > throws IOException { > // sanity checks > dfsClient.checkOpen(); > if (closed) { > throw new IOException("Stream closed"); > } > failures = 0; > long filelen = getFileLength(); > {code} > the getFileLength() also needs lock. so we need to figure out a no lock impl > for getFileLength() before HBase multi stream feature done. > [~saint@gmail.com] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6698) try to optimize DFSInputStream.getFileLength()
[ https://issues.apache.org/jira/browse/HDFS-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HDFS-6698: Attachment: HDFS-6698.txt Retry > try to optimize DFSInputStream.getFileLength() > -- > > Key: HDFS-6698 > URL: https://issues.apache.org/jira/browse/HDFS-6698 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.0.0 >Reporter: Liang Xie >Assignee: Liang Xie > Attachments: HDFS-6698.txt, HDFS-6698.txt > > > HBase prefers to invoke read() serving scan request, and invoke pread() > serving get reqeust. Because pread() almost holds no lock. > Let's image there's a read() running, because the definition is: > {code} > public synchronized int read > {code} > so no other read() request could run concurrently, this is known, but pread() > also could not run... because: > {code} > public int read(long position, byte[] buffer, int offset, int length) > throws IOException { > // sanity checks > dfsClient.checkOpen(); > if (closed) { > throw new IOException("Stream closed"); > } > failures = 0; > long filelen = getFileLength(); > {code} > the getFileLength() also needs lock. so we need to figure out a no lock impl > for getFileLength() before HBase multi stream feature done. > [~saint@gmail.com] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6698) try to optimize DFSInputStream.getFileLength()
[ https://issues.apache.org/jira/browse/HDFS-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Xie updated HDFS-6698: Issue Type: Sub-task (was: Improvement) Parent: HDFS-6735 > try to optimize DFSInputStream.getFileLength() > -- > > Key: HDFS-6698 > URL: https://issues.apache.org/jira/browse/HDFS-6698 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.0.0 >Reporter: Liang Xie >Assignee: Liang Xie > Attachments: HDFS-6698.txt > > > HBase prefers to invoke read() serving scan request, and invoke pread() > serving get reqeust. Because pread() almost holds no lock. > Let's image there's a read() running, because the definition is: > {code} > public synchronized int read > {code} > so no other read() request could run concurrently, this is known, but pread() > also could not run... because: > {code} > public int read(long position, byte[] buffer, int offset, int length) > throws IOException { > // sanity checks > dfsClient.checkOpen(); > if (closed) { > throw new IOException("Stream closed"); > } > failures = 0; > long filelen = getFileLength(); > {code} > the getFileLength() also needs lock. so we need to figure out a no lock impl > for getFileLength() before HBase multi stream feature done. > [~saint@gmail.com] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6698) try to optimize DFSInputStream.getFileLength()
[ https://issues.apache.org/jira/browse/HDFS-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Xie updated HDFS-6698: Attachment: HDFS-6698.txt > try to optimize DFSInputStream.getFileLength() > -- > > Key: HDFS-6698 > URL: https://issues.apache.org/jira/browse/HDFS-6698 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.0.0 >Reporter: Liang Xie >Assignee: Liang Xie > Attachments: HDFS-6698.txt > > > HBase prefers to invoke read() serving scan request, and invoke pread() > serving get reqeust. Because pread() almost holds no lock. > Let's image there's a read() running, because the definition is: > {code} > public synchronized int read > {code} > so no other read() request could run concurrently, this is known, but pread() > also could not run... because: > {code} > public int read(long position, byte[] buffer, int offset, int length) > throws IOException { > // sanity checks > dfsClient.checkOpen(); > if (closed) { > throw new IOException("Stream closed"); > } > failures = 0; > long filelen = getFileLength(); > {code} > the getFileLength() also needs lock. so we need to figure out a no lock impl > for getFileLength() before HBase multi stream feature done. > [~saint@gmail.com] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6698) try to optimize DFSInputStream.getFileLength()
[ https://issues.apache.org/jira/browse/HDFS-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Xie updated HDFS-6698: Status: Patch Available (was: Open) > try to optimize DFSInputStream.getFileLength() > -- > > Key: HDFS-6698 > URL: https://issues.apache.org/jira/browse/HDFS-6698 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.0.0 >Reporter: Liang Xie >Assignee: Liang Xie > Attachments: HDFS-6698.txt > > > HBase prefers to invoke read() serving scan request, and invoke pread() > serving get reqeust. Because pread() almost holds no lock. > Let's image there's a read() running, because the definition is: > {code} > public synchronized int read > {code} > so no other read() request could run concurrently, this is known, but pread() > also could not run... because: > {code} > public int read(long position, byte[] buffer, int offset, int length) > throws IOException { > // sanity checks > dfsClient.checkOpen(); > if (closed) { > throw new IOException("Stream closed"); > } > failures = 0; > long filelen = getFileLength(); > {code} > the getFileLength() also needs lock. so we need to figure out a no lock impl > for getFileLength() before HBase multi stream feature done. > [~saint@gmail.com] -- This message was sent by Atlassian JIRA (v6.2#6252)