[jira] [Updated] (KUDU-2052) Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems
[ https://issues.apache.org/jira/browse/KUDU-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adar Dembo updated KUDU-2052: - Code Review: http://gerrit.cloudera.org:8080/7269 Labels: data-scalability (was: ) > Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems > -- > > Key: KUDU-2052 > URL: https://issues.apache.org/jira/browse/KUDU-2052 > Project: Kudu > Issue Type: Bug > Components: util >Affects Versions: 1.4.0 >Reporter: Adar Dembo >Assignee: Adar Dembo >Priority: Critical > Labels: data-scalability > > One of the changes in Kudu 1.4 is a more comprehensive repair functionality > in log block manager startup. Amongst other things this includes a heuristic > to detect whether an LBM container consumes more disk space than it should, > based on the live blocks in the container. If the heuristic fires, the LBM > reclaims the extra disk space by truncating the end of the container and > repunching out all of the dead blocks in the container. > We brought up Kudu 1.4 on a large production cluster running xfs and observed > pathologically slow startup times. On one node, there was a three hour gap > between the last bit of data directory processing and the end of LBM startup > in general. This time can only be attributed to hole repunching, which is > executed by the same set of thread pools that open the data directories. > Further research revealed that on xfs in el6, a hole punch via fallocate() > _always_ includes an fsync() (in the kernel), even if the underlying data was > already punched out. This isn't the case with ext4, nor does it appear to be > the case with xfs in more modern kernels (though this hasn't been confirmed). > xfs provides the [XFS_IOC_UNRESVSP64 > ioctl|https://linux.die.net/man/3/xfsctl], which can be used to deallocate > space from a file. That sounds an awful lot like hole punching, and some > quick performance tests show that it doesn't incur the cost of an fsync(). We > should switch over to it when punching holes on xfs. Certainly on older (i.e. > el6) kernels, and potentially everywhere for simplicity's sake. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KUDU-2052) Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems
[ https://issues.apache.org/jira/browse/KUDU-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060134#comment-16060134 ] Jean-Daniel Cryans commented on KUDU-2052: -- What should we recommend to folks who are upgrading to 1.4 who are on xfs and el6? > Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems > -- > > Key: KUDU-2052 > URL: https://issues.apache.org/jira/browse/KUDU-2052 > Project: Kudu > Issue Type: Bug > Components: util >Affects Versions: 1.4.0 >Reporter: Adar Dembo >Assignee: Adar Dembo >Priority: Critical > > One of the changes in Kudu 1.4 is a more comprehensive repair functionality > in log block manager startup. Amongst other things this includes a heuristic > to detect whether an LBM container consumes more disk space than it should, > based on the live blocks in the container. If the heuristic fires, the LBM > reclaims the extra disk space by truncating the end of the container and > repunching out all of the dead blocks in the container. > We brought up Kudu 1.4 on a large production cluster running xfs and observed > pathologically slow startup times. On one node, there was a three hour gap > between the last bit of data directory processing and the end of LBM startup > in general. This time can only be attributed to hole repunching, which is > executed by the same set of thread pools that open the data directories. > Further research revealed that on xfs in el6, a hole punch via fallocate() > _always_ includes an fsync() (in the kernel), even if the underlying data was > already punched out. This isn't the case with ext4, nor does it appear to be > the case with xfs in more modern kernels (though this hasn't been confirmed). > xfs provides the [XFS_IOC_UNRESVSP64 > ioctl|https://linux.die.net/man/3/xfsctl], which can be used to deallocate > space from a file. That sounds an awful lot like hole punching, and some > quick performance tests show that it doesn't incur the cost of an fsync(). We > should switch over to it when punching holes on xfs. Certainly on older (i.e. > el6) kernels, and potentially everywhere for simplicity's sake. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KUDU-2052) Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems
[ https://issues.apache.org/jira/browse/KUDU-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060114#comment-16060114 ] Adar Dembo commented on KUDU-2052: -- The following experiments were run using a single spinning HDD on an CentOS 6.6 machine. First, a filesystem was created in a 10G file and mounted with -o loop. Inside the filesystem a 1G test file was created using dd from /dev/urandom. The first 100M of the test file were punched out via fallocate, sync was run, and then the following repunching tests in a loop using perf stat. ext4 with fallocate-based hole punching: {noformat} Performance counter stats for 'fallocate -p -o 0 -l 100M foo' (1000 runs): 0.269927 task-clock#0.635 CPUs utilized ( +- 0.33% ) 0 context-switches #0.000 K/sec 0 cpu-migrations#0.041 K/sec ( +- 30.00% ) 150 page-faults #0.555 M/sec ( +- 0.01% ) 770,390 cycles#2.854 GHz ( +- 0.33% ) stalled-cycles-frontend stalled-cycles-backend 477,294 instructions #0.62 insns per cycle ( +- 0.04% ) 98,535 branches # 365.044 M/sec ( +- 0.03% ) 3,979 branch-misses #4.04% of all branches ( +- 0.61% ) 0.000425207 seconds time elapsed ( +- 0.45% ) {noformat} xfs filesystem with fallocate-based hole punching: {noformat} Performance counter stats for 'fallocate -p -o 0 -l 100M foo' (1000 runs): 0.403296 task-clock#0.013 CPUs utilized ( +- 0.32% ) 2 context-switches #0.005 M/sec ( +- 0.17% ) 0 cpu-migrations#0.017 K/sec ( +- 37.68% ) 150 page-faults #0.371 M/sec ( +- 0.01% ) 1,112,706 cycles#2.759 GHz ( +- 0.17% ) stalled-cycles-frontend stalled-cycles-backend 505,027 instructions #0.45 insns per cycle ( +- 0.03% ) 103,652 branches # 257.013 M/sec ( +- 0.02% ) 5,750 branch-misses #5.55% of all branches ( +- 0.04% ) 0.031220273 seconds time elapsed ( +- 0.57% ) {noformat} xfs filesystem with XFS_IOC_UNRESVSP64-based hole punching: {noformat} Performance counter stats for 'xfs_io -c unresvsp 0 104857600 foo' (1000 runs): 0.477930 task-clock#0.677 CPUs utilized ( +- 0.28% ) 0 context-switches #0.004 K/sec ( +- 70.68% ) 0 cpu-migrations#0.017 K/sec ( +- 39.42% ) 215 page-faults #0.449 M/sec ( +- 0.01% ) 1,463,629 cycles#3.062 GHz ( +- 0.15% ) stalled-cycles-frontend stalled-cycles-backend 1,150,346 instructions #0.79 insns per cycle ( +- 0.01% ) 241,338 branches # 504.964 M/sec ( +- 0.01% ) 9,753 branch-misses #4.04% of all branches ( +- 0.02% ) 0.000706070 seconds time elapsed ( +- 0.36% ) {noformat} > Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems > -- > > Key: KUDU-2052 > URL: https://issues.apache.org/jira/browse/KUDU-2052 > Project: Kudu > Issue Type: Bug > Components: util >Affects Versions: 1.4.0 >Reporter: Adar Dembo >Assignee: Adar Dembo >Priority: Critical > > One of the changes in Kudu 1.4 is a more comprehensive repair functionality > in log block manager startup. Amongst other things this includes a heuristic > to detect whether an LBM container consumes more disk space than it should, > based on the live blocks in the container. If the heuristic fires, the LBM > reclaims the extra disk space by truncating the end of the container and > repunching out all of the dead blocks in the container. > We brought up Kudu 1.4 on a large production cluster running xfs and observed > pathologically slow startup times. On one node, there was a three hour gap > between the last bit of data directory processing and the end of LBM startup > in general. This time can only
[jira] [Created] (KUDU-2052) Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems
Adar Dembo created KUDU-2052: Summary: Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems Key: KUDU-2052 URL: https://issues.apache.org/jira/browse/KUDU-2052 Project: Kudu Issue Type: Bug Components: util Affects Versions: 1.4.0 Reporter: Adar Dembo Assignee: Adar Dembo Priority: Critical One of the changes in Kudu 1.4 is a more comprehensive repair functionality in log block manager startup. Amongst other things this includes a heuristic to detect whether an LBM container consumes more disk space than it should, based on the live blocks in the container. If the heuristic fires, the LBM reclaims the extra disk space by truncating the end of the container and repunching out all of the dead blocks in the container. We brought up Kudu 1.4 on a large production cluster running xfs and observed pathologically slow startup times. On one node, there was a three hour gap between the last bit of data directory processing and the end of LBM startup in general. This time can only be attributed to hole repunching, which is executed by the same set of thread pools that open the data directories. Further research revealed that on xfs in el6, a hole punch via fallocate() _always_ includes an fsync() (in the kernel), even if the underlying data was already punched out. This isn't the case with ext4, nor does it appear to be the case with xfs in more modern kernels (though this hasn't been confirmed). xfs provides the [XFS_IOC_UNRESVSP64 ioctl|https://linux.die.net/man/3/xfsctl], which can be used to deallocate space from a file. That sounds an awful lot like hole punching, and some quick performance tests show that it doesn't incur the cost of an fsync(). We should switch over to it when punching holes on xfs. Certainly on older (i.e. el6) kernels, and potentially everywhere for simplicity's sake. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (KUDU-2051) [DOCS] Update Impala integration limitations
[ https://issues.apache.org/jira/browse/KUDU-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ambreen Kazi updated KUDU-2051: --- Description: Remove timestamp and predicate evaluation for NULL/NOT NULL from list of Impala/Kudu limitations (IMPALA-4859). https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations cc: [~tlipcon] - does that sound right? was: Remove timestamp and predicate evaluation for NULL/NOT NULL from list of Impala/Kudu limitations. https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations cc: [~tlipcon] - does that sound right? > [DOCS] Update Impala integration limitations > - > > Key: KUDU-2051 > URL: https://issues.apache.org/jira/browse/KUDU-2051 > Project: Kudu > Issue Type: Task > Components: documentation >Affects Versions: 1.4.0 >Reporter: Ambreen Kazi >Assignee: Ambreen Kazi > > Remove timestamp and predicate evaluation for NULL/NOT NULL from list of > Impala/Kudu limitations (IMPALA-4859). > https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations > cc: [~tlipcon] - does that sound right? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KUDU-2051) [DOCS] Update Impala integration limitations
Ambreen Kazi created KUDU-2051: -- Summary: [DOCS] Update Impala integration limitations Key: KUDU-2051 URL: https://issues.apache.org/jira/browse/KUDU-2051 Project: Kudu Issue Type: Task Components: documentation Affects Versions: 1.4.0 Reporter: Ambreen Kazi Assignee: Ambreen Kazi Remove timestamp and predicate evaluation for NULL/NOT NULL from list of Impala/Kudu limitations. https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations cc: [~tlipcon] - does that sound right? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KUDU-2050) Avoid peer eviction during block manager startup
[ https://issues.apache.org/jira/browse/KUDU-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059889#comment-16059889 ] Dan Burkert commented on KUDU-2050: --- I don't think it's a good idea to treat a tablet that is de-facto offline as bootstrapping. This has the potential to greatly increase the window in which a tablet is under-replicated. I think longterm it's preferrable to over-replicate to 4 replicas, then kill off one as the node finally comes back. > Avoid peer eviction during block manager startup > > > Key: KUDU-2050 > URL: https://issues.apache.org/jira/browse/KUDU-2050 > Project: Kudu > Issue Type: Bug > Components: fs, tserver >Affects Versions: 1.4.0 >Reporter: Adar Dembo >Priority: Critical > > In larger deployments we've observed that opening the block manager can take > a really long time, like tens of minutes or sometimes even hours. This is > especially true as of 1.4 where the log block manager tries to optimize > on-disk data structures during startup. > The default time to Raft peer eviction is 5 minutes. If one node is restarted > and LBM startup takes over 5 minutes, or if all nodes are restarted and > there's over 5 minutes of LBM startup time variance across them, the "slow" > node could have all of its replicas evicted. Besides generating a lot of > unnecessary work in rereplication, this effectively "defeats" the LBM > optimizations in that it would have been equally slow (but more efficient) to > reformat the node instead. > So, let's reorder startup such that LBM startup counts towards replica > bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta > files can be accessed early to construct bootstrapping replicas, but to defer > opening of the block manager until after that time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KUDU-2050) Avoid peer eviction during block manager startup
Adar Dembo created KUDU-2050: Summary: Avoid peer eviction during block manager startup Key: KUDU-2050 URL: https://issues.apache.org/jira/browse/KUDU-2050 Project: Kudu Issue Type: Bug Components: fs, tserver Affects Versions: 1.4.0 Reporter: Adar Dembo Priority: Critical In larger deployments we've observed that opening the block manager can take a really long time, like tens of minutes or sometimes even hours. This is especially true as of 1.4 where the log block manager tries to optimize on-disk data structures during startup. The default time to Raft peer eviction is 5 minutes. If one node is restarted and LBM startup takes over 5 minutes, or if all nodes are restarted and there's over 5 minutes of LBM startup time variance across them, the "slow" node could have all of its replicas evicted. Besides generating a lot of unnecessary work in rereplication, this effectively "defeats" the LBM optimizations in that it would have been equally slow (but more efficient) to reformat the node instead. So, let's reorder startup such that LBM startup counts towards replica bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta files can be accessed early to construct bootstrapping replicas, but to defer opening of the block manager until after that time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (KUDU-2049) Too-strict check on RLE-encoded integer columns causes crash on scan
[ https://issues.apache.org/jira/browse/KUDU-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Berkeley reassigned KUDU-2049: --- Assignee: Will Berkeley > Too-strict check on RLE-encoded integer columns causes crash on scan > > > Key: KUDU-2049 > URL: https://issues.apache.org/jira/browse/KUDU-2049 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: Will Berkeley >Assignee: Will Berkeley > > Sometimes scans of RLE-encoded integer columns cause CHECK failures due to > the CHECK condition being too strict: > {{Check failed: pos < num_elems_ (128 vs. 128) Tried to seek to 128 which is > >= number of elements (128) in the block!}} > It's valid to scan just past the number of elements in the block, though. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KUDU-2049) Too-strict check on RLE-encoded integer columns causes crash on scan
Will Berkeley created KUDU-2049: --- Summary: Too-strict check on RLE-encoded integer columns causes crash on scan Key: KUDU-2049 URL: https://issues.apache.org/jira/browse/KUDU-2049 Project: Kudu Issue Type: Bug Affects Versions: 1.4.0 Reporter: Will Berkeley Sometimes scans of RLE-encoded integer columns cause CHECK failures due to the CHECK condition being too strict: {{Check failed: pos < num_elems_ (128 vs. 128) Tried to seek to 128 which is >= number of elements (128) in the block!}} It's valid to scan just past the number of elements in the block, though. -- This message was sent by Atlassian JIRA (v6.4.14#64029)