[jira] [Commented] (KUDU-2165) alter_table-test has mysterious TSAN warnings

2017-11-20 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260315#comment-16260315
 ] 

Todd Lipcon commented on KUDU-2165:
---

I think this is just because the cache is a Singleton, and each 
internal-minicluster daemon ends up calling StartInstrumentation. So, when 
multiple daemons start, they all race to set metrics, and end up overwriting 
the metrics set by other daemons. Additionally, if you restart a server while 
another one is actively working, you'll also end up destructing the 
CacheMetrics underneath a concurrent-operating cache, causing the race.

> alter_table-test has mysterious TSAN warnings
> -
>
> Key: KUDU-2165
> URL: https://issues.apache.org/jira/browse/KUDU-2165
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Adar Dembo
> Attachments: alter_table-test.txt
>
>
> I say "mysterious" because the warnings don't show one half of the stack 
> trace, so it's not immediately clear where the race is.
> Here are the warnings. I'm attaching the full log too.
> {noformat}
> ==
> WARNING: ThreadSanitizer: data race (pid=1945)
>   Write of size 8 at 0x7b0c000913b0 by main thread:
> #0 operator delete(void*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-4.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cc:73
>  (alter_table-test+0x4db471)
> #1 kudu::AtomicGauge::~AtomicGauge() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/metrics.h:768:7 
> (libtablet.so+0x15a561)
> #2 kudu::RefCountedThreadSafe kudu::DefaultRefCountedThreadSafeTraits 
> >::DeleteInternal(kudu::Metric const*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/gutil/ref_counted.h:153:44
>  (libmaster.so+0xc1787)
> #3 
> kudu::DefaultRefCountedThreadSafeTraits::Destruct(kudu::Metric 
> const*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/gutil/ref_counted.h:116:5
>  (libmaster.so+0xc1749)
> #4 kudu::RefCountedThreadSafe kudu::DefaultRefCountedThreadSafeTraits >::Release() const 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/gutil/ref_counted.h:144:7
>  (libmaster.so+0xc1709)
> #5 scoped_refptr::~scoped_refptr() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/gutil/ref_counted.h:266:13
>  (libtablet.so+0x1596aa)
> #6 kudu::CacheMetrics::~CacheMetrics() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/cache_metrics.h:27:8
>  (libkudu_util.so+0xe8cda)
> #7 
> kudu::DefaultDeleter::operator()(kudu::CacheMetrics*) 
> const 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/gutil/gscoped_ptr.h:145:5
>  (libkudu_util.so+0xe8c9e)
> #8 kudu::internal::gscoped_ptr_impl kudu::DefaultDeleter >::reset(kudu::CacheMetrics*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/gutil/gscoped_ptr.h:254:7
>  (libkudu_util.so+0xe9404)
> #9 gscoped_ptr kudu::DefaultDeleter >::reset(kudu::CacheMetrics*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/gutil/gscoped_ptr.h:375:46
>  (libkudu_util.so+0xe9370)
> #10 kudu::(anonymous 
> namespace)::ShardedLRUCache::SetMetrics(scoped_refptr 
> const&) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/cache.cc:464:14 
> (libkudu_util.so+0xe6840)
> #11 
> kudu::cfile::BlockCache::StartInstrumentation(scoped_refptr
>  const&) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/cfile/block_cache.cc:96:11
>  (libcfile.so+0x72af4)
> #12 kudu::master::Master::Init() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/master/master.cc:118:38 
> (libmaster.so+0xe5ff6)
> #13 kudu::master::MiniMaster::Start() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/master/mini_master.cc:92:3
>  (libmaster.so+0x10070e)
> #14 kudu::cluster::InternalMiniCluster::StartSingleMaster() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/mini-cluster/internal_mini_cluster.cc:173:3
>  (libmini-cluster.so+0x3f0cf)
> #15 kudu::cluster::InternalMiniCluster::Start() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/mini-cluster/internal_mini_cluster.cc:102:5
>  (libmini-cluster.so+0x3e03b)
> #16 kudu::AlterTableTest::SetUp() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/integration-tests/alter_table-test.cc:137:5
>  (alter_table-test+0x512f9f)
> #17 void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.8.0/googletest/src/gtest.cc:2402:10
>  (libgmock.so+0x52b39)
> #18 void 
> testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void 

[jira] [Resolved] (KUDU-2221) Improve server startup error message when glog files have the wrong ACLs

2017-11-20 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2221.
--
   Resolution: Duplicate
Fix Version/s: n/a

Oh man, you are right :)

> Improve server startup error message when glog files have the wrong ACLs
> 
>
> Key: KUDU-2221
> URL: https://issues.apache.org/jira/browse/KUDU-2221
> Project: Kudu
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.4.0
>Reporter: Jean-Daniel Cryans
> Fix For: n/a
>
>
> On a server where the glog files were played with as "root", the master 
> refused to start due to:
> {noformat}
> master_main.cc:68] Check failed: _s.ok() Bad status: IO error: Unable to 
> delete excess log files: glob failure: 2
> {noformat}
> The existing log files belonged to root instead of kudu, chown'ing fixed the 
> issue, but this message could be made easier to parse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2221) Improve server startup error message when glog files have the wrong ACLs

2017-11-20 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260156#comment-16260156
 ] 

Todd Lipcon commented on KUDU-2221:
---

I think I already fixed this as KUDU-2205. Resolve as dup?

> Improve server startup error message when glog files have the wrong ACLs
> 
>
> Key: KUDU-2221
> URL: https://issues.apache.org/jira/browse/KUDU-2221
> Project: Kudu
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.4.0
>Reporter: Jean-Daniel Cryans
>
> On a server where the glog files were played with as "root", the master 
> refused to start due to:
> {noformat}
> master_main.cc:68] Check failed: _s.ok() Bad status: IO error: Unable to 
> delete excess log files: glob failure: 2
> {noformat}
> The existing log files belonged to root instead of kudu, chown'ing fixed the 
> issue, but this message could be made easier to parse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KUDU-2218) SSL3_WRITE_PENDING TlsSocket error

2017-11-20 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260112#comment-16260112
 ] 

Alexey Serbin edited comment on KUDU-2218 at 11/21/17 1:06 AM:
---

The patches for back-porting those two fixes into branch-1.[3-5].x branches are 
ready for review:

https://gerrit.cloudera.org/#/c/8599/
https://gerrit.cloudera.org/#/c/8600/
https://gerrit.cloudera.org/#/c/8601/
https://gerrit.cloudera.org/#/c/8602/
https://gerrit.cloudera.org/#/c/8603/
https://gerrit.cloudera.org/#/c/8604/



was (Author: aserbin):
The patches for back-porting those two fixes are ready for review:

https://gerrit.cloudera.org/#/c/8599/
https://gerrit.cloudera.org/#/c/8600/
https://gerrit.cloudera.org/#/c/8601/
https://gerrit.cloudera.org/#/c/8602/
https://gerrit.cloudera.org/#/c/8603/
https://gerrit.cloudera.org/#/c/8604/


> SSL3_WRITE_PENDING TlsSocket error
> --
>
> Key: KUDU-2218
> URL: https://issues.apache.org/jira/browse/KUDU-2218
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc, security
> Environment: TSAN builds
>Reporter: Alexey Serbin
>Assignee: Todd Lipcon
> Fix For: 1.6.0
>
>
> The {{RaftConsensusITest.TestLargeBatches}} scenario exhibits flaky behavior, 
> if running the binary built in TSAN configuration.
> In most cases, the test fails with the error message like the following:
> {noformat}
> W1114 04:17:44.407441  1773 connection.cc:657] client connection to 
> 127.1.147.67:45133 send error: Network error: failed to write to TLS socket: 
> error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry:s3_pkt.c:826
> W1114 04:17:44.407811  1773 consensus_peers.cc:411] T 
> bc17135f3b5643dda691863392bda6a3 P 9e51dc8b039e4a599c7b77e5ac6b48fe -> Peer 
> 05a8069cf7a84f918864b5cb4f76f056 (127.1.147.67:45133): Couldn't send request 
> to peer 05a8069cf7a84f918864b5cb4f76f056 for tablet 
> bc17135f3b5643dda691863392bda6a3. Status: Network error: failed to write to 
> TLS socket: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write 
> retry:s3_pkt.c:826. Retrying in the next heartbeat period. Already tried 1 
> times.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2218) SSL3_WRITE_PENDING TlsSocket error

2017-11-20 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260112#comment-16260112
 ] 

Alexey Serbin commented on KUDU-2218:
-

The patches for back-porting those two fixes are ready for review:

https://gerrit.cloudera.org/#/c/8599/
https://gerrit.cloudera.org/#/c/8600/
https://gerrit.cloudera.org/#/c/8601/
https://gerrit.cloudera.org/#/c/8602/
https://gerrit.cloudera.org/#/c/8603/
https://gerrit.cloudera.org/#/c/8604/


> SSL3_WRITE_PENDING TlsSocket error
> --
>
> Key: KUDU-2218
> URL: https://issues.apache.org/jira/browse/KUDU-2218
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc, security
> Environment: TSAN builds
>Reporter: Alexey Serbin
>Assignee: Todd Lipcon
> Fix For: 1.6.0
>
>
> The {{RaftConsensusITest.TestLargeBatches}} scenario exhibits flaky behavior, 
> if running the binary built in TSAN configuration.
> In most cases, the test fails with the error message like the following:
> {noformat}
> W1114 04:17:44.407441  1773 connection.cc:657] client connection to 
> 127.1.147.67:45133 send error: Network error: failed to write to TLS socket: 
> error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry:s3_pkt.c:826
> W1114 04:17:44.407811  1773 consensus_peers.cc:411] T 
> bc17135f3b5643dda691863392bda6a3 P 9e51dc8b039e4a599c7b77e5ac6b48fe -> Peer 
> 05a8069cf7a84f918864b5cb4f76f056 (127.1.147.67:45133): Couldn't send request 
> to peer 05a8069cf7a84f918864b5cb4f76f056 for tablet 
> bc17135f3b5643dda691863392bda6a3. Status: Network error: failed to write to 
> TLS socket: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write 
> retry:s3_pkt.c:826. Retrying in the next heartbeat period. Already tried 1 
> times.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-1078) Under heavy load, log cache reads return "Op in future"

2017-11-20 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-1078.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Fixed in a8a7773cf28f0069b691d412f9dd23e72e3833fa

> Under heavy load, log cache reads return "Op in future"
> ---
>
> Key: KUDU-1078
> URL: https://issues.apache.org/jira/browse/KUDU-1078
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 1.6.0
>
>
> JD accidentally configured botl80 so that all nodes were writing to a single 
> tablet (talk about a stress test!) I see the following warning occasionally 
> in the leader logs:
> W0827 11:34:04.051275 55950 consensus_queue.cc:272] T 
> d6ff74fb04454712873abd0d2328e59b P 668314723deb4a818cc9a43eba51073c [LEADER]: 
> The logs necessary to catch up peer 600e62435ca24425b29c00d9726b78be have 
> been garbage collected. The follower will never be able to catch up (Not 
> found: op in future). Instructing remote peer to remotely bootstrap.
> Calling this a blocker because, if we couple this with the "automatically 
> re-bootstrap nodes if they fall behind", we'll be accidentally deleting 
> tablets when this happens (no good!)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (KUDU-2218) SSL3_WRITE_PENDING TlsSocket error

2017-11-20 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reopened KUDU-2218:
---

[~aserbin] do you think we should backport this to earlier branches as well? 
(along with the other WANT_RETRY-related fix?)

> SSL3_WRITE_PENDING TlsSocket error
> --
>
> Key: KUDU-2218
> URL: https://issues.apache.org/jira/browse/KUDU-2218
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc, security
> Environment: TSAN builds
>Reporter: Alexey Serbin
>Assignee: Todd Lipcon
> Fix For: 1.6.0
>
>
> The {{RaftConsensusITest.TestLargeBatches}} scenario exhibits flaky behavior, 
> if running the binary built in TSAN configuration.
> In most cases, the test fails with the error message like the following:
> {noformat}
> W1114 04:17:44.407441  1773 connection.cc:657] client connection to 
> 127.1.147.67:45133 send error: Network error: failed to write to TLS socket: 
> error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry:s3_pkt.c:826
> W1114 04:17:44.407811  1773 consensus_peers.cc:411] T 
> bc17135f3b5643dda691863392bda6a3 P 9e51dc8b039e4a599c7b77e5ac6b48fe -> Peer 
> 05a8069cf7a84f918864b5cb4f76f056 (127.1.147.67:45133): Couldn't send request 
> to peer 05a8069cf7a84f918864b5cb4f76f056 for tablet 
> bc17135f3b5643dda691863392bda6a3. Status: Network error: failed to write to 
> TLS socket: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write 
> retry:s3_pkt.c:826. Retrying in the next heartbeat period. Already tried 1 
> times.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2222) IO error in UpdateScanDeltaCompactionTest ASAN build

2017-11-20 Thread Andrew Wong (JIRA)
Andrew Wong created KUDU-:
-

 Summary: IO error in UpdateScanDeltaCompactionTest ASAN build
 Key: KUDU-
 URL: https://issues.apache.org/jira/browse/KUDU-
 Project: Kudu
  Issue Type: Bug
  Components: test
Reporter: Andrew Wong
 Attachments: update_scan_delta_compact-test.txt

In a few ASAN builds, update_scan_delta_compact-test fails with the message:

{{update_scan_delta_compact-test.cc:253] Check failed: _s.ok() Bad status: IO 
error: Some errors occurred}}

A nearby warning complains about an RPC timeout, which may point to a stressed 
execution environment. Regardless, it's not entirely obvious whether that's the 
only cause of failure given the generic error message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2215) kernel stack watchdog can delay thread exits

2017-11-20 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2215.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> kernel stack watchdog can delay thread exits
> 
>
> Key: KUDU-2215
> URL: https://issues.apache.org/jira/browse/KUDU-2215
> Project: Kudu
>  Issue Type: Bug
>  Components: perf, util
>Affects Versions: 1.6.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 1.6.0
>
>
> KernelStackWatchdog::Unregister can block if the watchdog is currently in the 
> process of dumping a thread's stacks. The stack dumping can take several 
> seconds in the worst case, so this can delay another thread from exiting.
> There's currently a comment indicating that we don't usually care about 
> delaying thread exits, but there are actually a few places where we do join() 
> on a thread. In particular, in earlier versions Peer::Close() joins on a 
> ResettableHeartbeater and thus can get stuck for a while.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2221) Improve server startup error message when glog files have the wrong ACLs

2017-11-20 Thread Jean-Daniel Cryans (JIRA)
Jean-Daniel Cryans created KUDU-2221:


 Summary: Improve server startup error message when glog files have 
the wrong ACLs
 Key: KUDU-2221
 URL: https://issues.apache.org/jira/browse/KUDU-2221
 Project: Kudu
  Issue Type: Improvement
  Components: server
Affects Versions: 1.4.0
Reporter: Jean-Daniel Cryans


On a server where the glog files were played with as "root", the master refused 
to start due to:

{noformat}
master_main.cc:68] Check failed: _s.ok() Bad status: IO error: Unable to delete 
excess log files: glob failure: 2
{noformat}

The existing log files belonged to root instead of kudu, chown'ing fixed the 
issue, but this message could be made easier to parse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2220) GetEndOfChainX509 does not return end-user cert

2017-11-20 Thread Sailesh Mukil (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259553#comment-16259553
 ] 

Sailesh Mukil commented on KUDU-2220:
-

Patch out for review:
https://gerrit.cloudera.org/#/c/8595/

> GetEndOfChainX509 does not return end-user cert
> ---
>
> Key: KUDU-2220
> URL: https://issues.apache.org/jira/browse/KUDU-2220
> Project: Kudu
>  Issue Type: Bug
>  Components: security
>Affects Versions: 1.5.0
>Reporter: Sailesh Mukil
>Assignee: Sailesh Mukil
>
> KUDU-2091 introduced a function GetEndOfChainX509() which was supposed to 
> return the "end-user" certificate. However, the end-user certificate is not 
> at the end of the chain, but rather at the beginning of the chain as 
> specificed by the RFC:
> https://tools.ietf.org/html/rfc5246#section-7.4.2
> {quote}This is a sequence (chain) of certificates.  The sender's certificate 
> MUST come first in the list.  Each following certificate MUST directly 
> certify the one preceding it.{quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)