[jira] [Updated] (KUDU-2052) Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems

2017-06-22 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-2052:
-
Code Review: http://gerrit.cloudera.org:8080/7269
 Labels: data-scalability  (was: )

> Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems
> --
>
> Key: KUDU-2052
> URL: https://issues.apache.org/jira/browse/KUDU-2052
> Project: Kudu
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>Priority: Critical
>  Labels: data-scalability
>
> One of the changes in Kudu 1.4 is a more comprehensive repair functionality 
> in log block manager startup. Amongst other things this includes a heuristic 
> to detect whether an LBM container consumes more disk space than it should, 
> based on the live blocks in the container. If the heuristic fires, the LBM 
> reclaims the extra disk space by truncating the end of the container and 
> repunching out all of the dead blocks in the container.
> We brought up Kudu 1.4 on a large production cluster running xfs and observed 
> pathologically slow startup times. On one node, there was a three hour gap 
> between the last bit of data directory processing and the end of LBM startup 
> in general. This time can only be attributed to hole repunching, which is 
> executed by the same set of thread pools that open the data directories.
> Further research revealed that on xfs in el6, a hole punch via fallocate() 
> _always_ includes an fsync() (in the kernel), even if the underlying data was 
> already punched out. This isn't the case with ext4, nor does it appear to be 
> the case with xfs in more modern kernels (though this hasn't been confirmed).
> xfs provides the [XFS_IOC_UNRESVSP64 
> ioctl|https://linux.die.net/man/3/xfsctl], which can be used to deallocate 
> space from a file. That sounds an awful lot like hole punching, and some 
> quick performance tests show that it doesn't incur the cost of an fsync(). We 
> should switch over to it when punching holes on xfs. Certainly on older (i.e. 
> el6) kernels, and potentially everywhere for simplicity's sake.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2052) Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems

2017-06-22 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060134#comment-16060134
 ] 

Jean-Daniel Cryans commented on KUDU-2052:
--

What should we recommend to folks who are upgrading to 1.4 who are on xfs and 
el6?

> Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems
> --
>
> Key: KUDU-2052
> URL: https://issues.apache.org/jira/browse/KUDU-2052
> Project: Kudu
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>Priority: Critical
>
> One of the changes in Kudu 1.4 is a more comprehensive repair functionality 
> in log block manager startup. Amongst other things this includes a heuristic 
> to detect whether an LBM container consumes more disk space than it should, 
> based on the live blocks in the container. If the heuristic fires, the LBM 
> reclaims the extra disk space by truncating the end of the container and 
> repunching out all of the dead blocks in the container.
> We brought up Kudu 1.4 on a large production cluster running xfs and observed 
> pathologically slow startup times. On one node, there was a three hour gap 
> between the last bit of data directory processing and the end of LBM startup 
> in general. This time can only be attributed to hole repunching, which is 
> executed by the same set of thread pools that open the data directories.
> Further research revealed that on xfs in el6, a hole punch via fallocate() 
> _always_ includes an fsync() (in the kernel), even if the underlying data was 
> already punched out. This isn't the case with ext4, nor does it appear to be 
> the case with xfs in more modern kernels (though this hasn't been confirmed).
> xfs provides the [XFS_IOC_UNRESVSP64 
> ioctl|https://linux.die.net/man/3/xfsctl], which can be used to deallocate 
> space from a file. That sounds an awful lot like hole punching, and some 
> quick performance tests show that it doesn't incur the cost of an fsync(). We 
> should switch over to it when punching holes on xfs. Certainly on older (i.e. 
> el6) kernels, and potentially everywhere for simplicity's sake.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2052) Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems

2017-06-22 Thread Adar Dembo (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060114#comment-16060114
 ] 

Adar Dembo commented on KUDU-2052:
--

The following experiments were run using a single spinning HDD on an CentOS 6.6 
machine. First, a filesystem was created in a 10G file and mounted with -o 
loop. Inside the filesystem a 1G test file was created using dd from 
/dev/urandom. The first 100M of the test file were punched out via fallocate, 
sync was run, and then the following repunching tests in a loop using perf stat.

ext4 with fallocate-based hole punching:
{noformat}
 Performance counter stats for 'fallocate -p -o 0 -l 100M foo' (1000 runs):

  0.269927 task-clock#0.635 CPUs utilized   
 ( +-  0.33% )
 0 context-switches  #0.000 K/sec  
 0 cpu-migrations#0.041 K/sec   
 ( +- 30.00% )
   150 page-faults   #0.555 M/sec   
 ( +-  0.01% )
   770,390 cycles#2.854 GHz 
 ( +-  0.33% )
stalled-cycles-frontend 
stalled-cycles-backend  
   477,294 instructions  #0.62  insns per cycle 
 ( +-  0.04% )
98,535 branches  #  365.044 M/sec   
 ( +-  0.03% )
 3,979 branch-misses #4.04% of all branches 
 ( +-  0.61% )

   0.000425207 seconds time elapsed 
 ( +-  0.45% )
{noformat}

xfs filesystem with fallocate-based hole punching:
{noformat}
  Performance counter stats for 'fallocate -p -o 0 -l 100M foo' (1000 runs):

  0.403296 task-clock#0.013 CPUs utilized   
 ( +-  0.32% )
 2 context-switches  #0.005 M/sec   
 ( +-  0.17% )
 0 cpu-migrations#0.017 K/sec   
 ( +- 37.68% )
   150 page-faults   #0.371 M/sec   
 ( +-  0.01% )
 1,112,706 cycles#2.759 GHz 
 ( +-  0.17% )
stalled-cycles-frontend 
stalled-cycles-backend  
   505,027 instructions  #0.45  insns per cycle 
 ( +-  0.03% )
   103,652 branches  #  257.013 M/sec   
 ( +-  0.02% )
 5,750 branch-misses #5.55% of all branches 
 ( +-  0.04% )

   0.031220273 seconds time elapsed 
 ( +-  0.57% )
{noformat}

xfs filesystem with XFS_IOC_UNRESVSP64-based hole punching:
{noformat}
 Performance counter stats for 'xfs_io -c unresvsp 0 104857600 foo' (1000 runs):

  0.477930 task-clock#0.677 CPUs utilized   
 ( +-  0.28% )
 0 context-switches  #0.004 K/sec   
 ( +- 70.68% )
 0 cpu-migrations#0.017 K/sec   
 ( +- 39.42% )
   215 page-faults   #0.449 M/sec   
 ( +-  0.01% )
 1,463,629 cycles#3.062 GHz 
 ( +-  0.15% )
stalled-cycles-frontend 
stalled-cycles-backend  
 1,150,346 instructions  #0.79  insns per cycle 
 ( +-  0.01% )
   241,338 branches  #  504.964 M/sec   
 ( +-  0.01% )
 9,753 branch-misses #4.04% of all branches 
 ( +-  0.02% )

   0.000706070 seconds time elapsed 
 ( +-  0.36% )
{noformat}

> Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems
> --
>
> Key: KUDU-2052
> URL: https://issues.apache.org/jira/browse/KUDU-2052
> Project: Kudu
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>Priority: Critical
>
> One of the changes in Kudu 1.4 is a more comprehensive repair functionality 
> in log block manager startup. Amongst other things this includes a heuristic 
> to detect whether an LBM container consumes more disk space than it should, 
> based on the live blocks in the container. If the heuristic fires, the LBM 
> reclaims the extra disk space by truncating the end of the container and 
> repunching out all of the dead blocks in the container.
> We brought up Kudu 1.4 on a large production cluster running xfs and observed 
> pathologically slow startup times. On one node, there was a three hour gap 
> between the last bit of data directory processing and the end of LBM startup 
> in general. This time can only 

[jira] [Created] (KUDU-2052) Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems

2017-06-22 Thread Adar Dembo (JIRA)
Adar Dembo created KUDU-2052:


 Summary: Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs 
filesystems
 Key: KUDU-2052
 URL: https://issues.apache.org/jira/browse/KUDU-2052
 Project: Kudu
  Issue Type: Bug
  Components: util
Affects Versions: 1.4.0
Reporter: Adar Dembo
Assignee: Adar Dembo
Priority: Critical


One of the changes in Kudu 1.4 is a more comprehensive repair functionality in 
log block manager startup. Amongst other things this includes a heuristic to 
detect whether an LBM container consumes more disk space than it should, based 
on the live blocks in the container. If the heuristic fires, the LBM reclaims 
the extra disk space by truncating the end of the container and repunching out 
all of the dead blocks in the container.

We brought up Kudu 1.4 on a large production cluster running xfs and observed 
pathologically slow startup times. On one node, there was a three hour gap 
between the last bit of data directory processing and the end of LBM startup in 
general. This time can only be attributed to hole repunching, which is executed 
by the same set of thread pools that open the data directories.

Further research revealed that on xfs in el6, a hole punch via fallocate() 
_always_ includes an fsync() (in the kernel), even if the underlying data was 
already punched out. This isn't the case with ext4, nor does it appear to be 
the case with xfs in more modern kernels (though this hasn't been confirmed).

xfs provides the [XFS_IOC_UNRESVSP64 ioctl|https://linux.die.net/man/3/xfsctl], 
which can be used to deallocate space from a file. That sounds an awful lot 
like hole punching, and some quick performance tests show that it doesn't incur 
the cost of an fsync(). We should switch over to it when punching holes on xfs. 
Certainly on older (i.e. el6) kernels, and potentially everywhere for 
simplicity's sake.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2051) [DOCS] Update Impala integration limitations

2017-06-22 Thread Ambreen Kazi (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ambreen Kazi updated KUDU-2051:
---
Description: 
Remove timestamp and predicate evaluation for NULL/NOT NULL from list of 
Impala/Kudu limitations (IMPALA-4859).

https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations

cc: [~tlipcon] - does that sound right? 

  was:
Remove timestamp and predicate evaluation for NULL/NOT NULL from list of 
Impala/Kudu limitations.

https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations

cc: [~tlipcon] - does that sound right? 


> [DOCS] Update Impala integration limitations 
> -
>
> Key: KUDU-2051
> URL: https://issues.apache.org/jira/browse/KUDU-2051
> Project: Kudu
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.4.0
>Reporter: Ambreen Kazi
>Assignee: Ambreen Kazi
>
> Remove timestamp and predicate evaluation for NULL/NOT NULL from list of 
> Impala/Kudu limitations (IMPALA-4859).
> https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations
> cc: [~tlipcon] - does that sound right? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2051) [DOCS] Update Impala integration limitations

2017-06-22 Thread Ambreen Kazi (JIRA)
Ambreen Kazi created KUDU-2051:
--

 Summary: [DOCS] Update Impala integration limitations 
 Key: KUDU-2051
 URL: https://issues.apache.org/jira/browse/KUDU-2051
 Project: Kudu
  Issue Type: Task
  Components: documentation
Affects Versions: 1.4.0
Reporter: Ambreen Kazi
Assignee: Ambreen Kazi


Remove timestamp and predicate evaluation for NULL/NOT NULL from list of 
Impala/Kudu limitations.

https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations

cc: [~tlipcon] - does that sound right? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2050) Avoid peer eviction during block manager startup

2017-06-22 Thread Dan Burkert (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059889#comment-16059889
 ] 

Dan Burkert commented on KUDU-2050:
---

I don't think it's a good idea to treat a tablet that is de-facto offline as 
bootstrapping.  This has the potential to greatly increase the window in which 
a tablet is under-replicated.  I think longterm it's preferrable to 
over-replicate to 4 replicas, then kill off one as the node finally comes back.

> Avoid peer eviction during block manager startup
> 
>
> Key: KUDU-2050
> URL: https://issues.apache.org/jira/browse/KUDU-2050
> Project: Kudu
>  Issue Type: Bug
>  Components: fs, tserver
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Priority: Critical
>
> In larger deployments we've observed that opening the block manager can take 
> a really long time, like tens of minutes or sometimes even hours. This is 
> especially true as of 1.4 where the log block manager tries to optimize 
> on-disk data structures during startup.
> The default time to Raft peer eviction is 5 minutes. If one node is restarted 
> and LBM startup takes over 5 minutes, or if all nodes are restarted and 
> there's over 5 minutes of LBM startup time variance across them, the "slow" 
> node could have all of its replicas evicted. Besides generating a lot of 
> unnecessary work in rereplication, this effectively "defeats" the LBM 
> optimizations in that it would have been equally slow (but more efficient) to 
> reformat the node instead.
> So, let's reorder startup such that LBM startup counts towards replica 
> bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta 
> files can be accessed early to construct bootstrapping replicas, but to defer 
> opening of the block manager until after that time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2050) Avoid peer eviction during block manager startup

2017-06-22 Thread Adar Dembo (JIRA)
Adar Dembo created KUDU-2050:


 Summary: Avoid peer eviction during block manager startup
 Key: KUDU-2050
 URL: https://issues.apache.org/jira/browse/KUDU-2050
 Project: Kudu
  Issue Type: Bug
  Components: fs, tserver
Affects Versions: 1.4.0
Reporter: Adar Dembo
Priority: Critical


In larger deployments we've observed that opening the block manager can take a 
really long time, like tens of minutes or sometimes even hours. This is 
especially true as of 1.4 where the log block manager tries to optimize on-disk 
data structures during startup.

The default time to Raft peer eviction is 5 minutes. If one node is restarted 
and LBM startup takes over 5 minutes, or if all nodes are restarted and there's 
over 5 minutes of LBM startup time variance across them, the "slow" node could 
have all of its replicas evicted. Besides generating a lot of unnecessary work 
in rereplication, this effectively "defeats" the LBM optimizations in that it 
would have been equally slow (but more efficient) to reformat the node instead.

So, let's reorder startup such that LBM startup counts towards replica 
bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta 
files can be accessed early to construct bootstrapping replicas, but to defer 
opening of the block manager until after that time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KUDU-2049) Too-strict check on RLE-encoded integer columns causes crash on scan

2017-06-22 Thread Will Berkeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Berkeley reassigned KUDU-2049:
---

Assignee: Will Berkeley

> Too-strict check on RLE-encoded integer columns causes crash on scan
> 
>
> Key: KUDU-2049
> URL: https://issues.apache.org/jira/browse/KUDU-2049
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Will Berkeley
>Assignee: Will Berkeley
>
> Sometimes scans of RLE-encoded integer columns cause CHECK failures due to 
> the CHECK condition being too strict:
> {{Check failed: pos < num_elems_ (128 vs. 128) Tried to seek to 128 which is 
> >= number of elements (128) in the block!}}
> It's valid to scan just past the number of elements in the block, though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2049) Too-strict check on RLE-encoded integer columns causes crash on scan

2017-06-22 Thread Will Berkeley (JIRA)
Will Berkeley created KUDU-2049:
---

 Summary: Too-strict check on RLE-encoded integer columns causes 
crash on scan
 Key: KUDU-2049
 URL: https://issues.apache.org/jira/browse/KUDU-2049
 Project: Kudu
  Issue Type: Bug
Affects Versions: 1.4.0
Reporter: Will Berkeley


Sometimes scans of RLE-encoded integer columns cause CHECK failures due to the 
CHECK condition being too strict:

{{Check failed: pos < num_elems_ (128 vs. 128) Tried to seek to 128 which is >= 
number of elements (128) in the block!}}

It's valid to scan just past the number of elements in the block, though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)