from:"Mike Percy \(JIRA\)"

[jira] [Resolved] (KUDU-639) Leader doesn't overwrite demoted follower's log properly

2020-06-18 Thread Mike Percy (Jira)



 [ 
https://issues.apache.org/jira/browse/KUDU-639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy resolved KUDU-639.
-
Resolution: Fixed

> Leader doesn't overwrite demoted follower's log properly
> 
>
> Key: KUDU-639
> URL: https://issues.apache.org/jira/browse/KUDU-639
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: M4.5
>Reporter: David Alves
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: M5
>
>
> We just ran into this situation in the YCSB cluster, which is apparently a 
> log divergence.
> We have nodes a, b, c (corresponding to nodes 
> 33c8fb1dc4434df0938ccc27ecfd58a1/a1219, 
> 4ed2e09f80e04d198edeb53e15b3539e/a1220, 
> ab8ed89f9041495a95b8d2b77591c9d7/a1215).
> Node a is leader for term 3, timesout
> Node b is elected leader for term 5 with votes from b, c
> When b is elected leader the log state is:
> State: All replicated op: 3.6546, Majority replicated op: 3.6533, Committed 
> index: 3.6533, Last appended: 3.6546, Current term: 5
> b never actually replicates anything and eventually loses leadership to node 
> a, again.
> When b loses leadership it's wall is at the following state:
> State: All replicated op: 0.0, Majority replicated op: 3.6533, Committed 
> index: 3.6533, Last appended: 5.6547, Current term: 5
> That is b appended a message in term 5 but never actually got to commit it.
> However, if we look at b's log we find a message in term 5 committed:
> 3.6546@99404  REPLICATE WRITE_OP
> COMMIT 3.6533
> 5.6547@99789  REPLICATE CHANGE_CONFIG_OP
> COMMIT 3.6535
> COMMIT 3.6536
> COMMIT 3.6537
> COMMIT 3.6538
> COMMIT 3.6534
> COMMIT 3.6541
> COMMIT 3.6540
> COMMIT 3.6543
> COMMIT 3.6542
> COMMIT 3.6545
> COMMIT 3.6546
> COMMIT 3.6544
> COMMIT 3.6539
> COMMIT 5.6547
> 3.6548@99430  REPLICATE WRITE_OP
> 6.6549@99795  REPLICATE CHANGE_CONFIG_OP
> And more problematically, that diverges from the other two nodes's logs:
> 3.6546@99404  REPLICATE WRITE_OP
> COMMIT 3.6533
> COMMIT 3.6536
> COMMIT 3.6537
> COMMIT 3.6535
> COMMIT 3.6539
> COMMIT 3.6538
> COMMIT 3.6534
> COMMIT 3.6541
> COMMIT 3.6540
> COMMIT 3.6543
> COMMIT 3.6542
> COMMIT 3.6544
> 3.6547@99429  REPLICATE WRITE_OP
> 3.6548@99430  REPLICATE WRITE_OP
> 6.6549@99795  REPLICATE CHANGE_CONFIG_OP
> 6.6550@99878  REPLICATE WRITE_OP
> 6.6551@99879  REPLICATE WRITE_OP
> 6.6552@99880  REPLICATE WRITE_OP
> COMMIT 3.6545
> COMMIT 3.6548
> COMMIT 3.6547
> COMMIT 3.6546
> COMMIT 6.6549



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KUDU-639) Leader doesn't overwrite demoted follower's log properly

2020-06-18 Thread Mike Percy (Jira)



 [ 
https://issues.apache.org/jira/browse/KUDU-639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-639:

Fix Version/s: M5

This was fixed in 2015. Please file a separate Jira to track the task if it 
seems likely someone will add a test for this

> Leader doesn't overwrite demoted follower's log properly
> 
>
> Key: KUDU-639
> URL: https://issues.apache.org/jira/browse/KUDU-639
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: M4.5
>Reporter: David Alves
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: M5
>
>
> We just ran into this situation in the YCSB cluster, which is apparently a 
> log divergence.
> We have nodes a, b, c (corresponding to nodes 
> 33c8fb1dc4434df0938ccc27ecfd58a1/a1219, 
> 4ed2e09f80e04d198edeb53e15b3539e/a1220, 
> ab8ed89f9041495a95b8d2b77591c9d7/a1215).
> Node a is leader for term 3, timesout
> Node b is elected leader for term 5 with votes from b, c
> When b is elected leader the log state is:
> State: All replicated op: 3.6546, Majority replicated op: 3.6533, Committed 
> index: 3.6533, Last appended: 3.6546, Current term: 5
> b never actually replicates anything and eventually loses leadership to node 
> a, again.
> When b loses leadership it's wall is at the following state:
> State: All replicated op: 0.0, Majority replicated op: 3.6533, Committed 
> index: 3.6533, Last appended: 5.6547, Current term: 5
> That is b appended a message in term 5 but never actually got to commit it.
> However, if we look at b's log we find a message in term 5 committed:
> 3.6546@99404  REPLICATE WRITE_OP
> COMMIT 3.6533
> 5.6547@99789  REPLICATE CHANGE_CONFIG_OP
> COMMIT 3.6535
> COMMIT 3.6536
> COMMIT 3.6537
> COMMIT 3.6538
> COMMIT 3.6534
> COMMIT 3.6541
> COMMIT 3.6540
> COMMIT 3.6543
> COMMIT 3.6542
> COMMIT 3.6545
> COMMIT 3.6546
> COMMIT 3.6544
> COMMIT 3.6539
> COMMIT 5.6547
> 3.6548@99430  REPLICATE WRITE_OP
> 6.6549@99795  REPLICATE CHANGE_CONFIG_OP
> And more problematically, that diverges from the other two nodes's logs:
> 3.6546@99404  REPLICATE WRITE_OP
> COMMIT 3.6533
> COMMIT 3.6536
> COMMIT 3.6537
> COMMIT 3.6535
> COMMIT 3.6539
> COMMIT 3.6538
> COMMIT 3.6534
> COMMIT 3.6541
> COMMIT 3.6540
> COMMIT 3.6543
> COMMIT 3.6542
> COMMIT 3.6544
> 3.6547@99429  REPLICATE WRITE_OP
> 3.6548@99430  REPLICATE WRITE_OP
> 6.6549@99795  REPLICATE CHANGE_CONFIG_OP
> 6.6550@99878  REPLICATE WRITE_OP
> 6.6551@99879  REPLICATE WRITE_OP
> 6.6552@99880  REPLICATE WRITE_OP
> COMMIT 3.6545
> COMMIT 3.6548
> COMMIT 3.6547
> COMMIT 3.6546
> COMMIT 6.6549



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KUDU-2870) Checksum scan fails with "Not authorized" error when authz enabled

2019-06-19 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2870:


 Summary: Checksum scan fails with "Not authorized" error when 
authz enabled
 Key: KUDU-2870
 URL: https://issues.apache.org/jira/browse/KUDU-2870
 Project: Kudu
  Issue Type: Bug
  Components: authz
Affects Versions: 1.10.0
Reporter: Mike Percy


While testing a Kudu 1.10.0 RC build with authorization enabled, I tried a 
checksum scan and it failed:
{code:java}
[mpercy@mpercy-c63s-0619-1 ~]$ kudu cluster ksck 
mpercy-c63s-0619-1.vpc.cloudera.com 
-tables=default.loadgen_auto_b527de07b2d842f3a3c82c5f85eb2854 -checksum_scan 
-sections=CHECKSUM_RESULTS
Checksum finished in 0s: 0/8 replicas remaining (0B from disk, 0 rows summed)
Checksum Summary
---
default.loadgen_auto_b527de07b2d842f3a3c82c5f85eb2854
---
T 09d0df0ca48c41bf94c6a3a03533b811 P 7d31913f6bbf4355a974c76e4f82c72a 
(mpercy-c63s-0619-4.vpc.cloudera.com:7050): Error: Remote error: Not 
authorized: no authorization token presented
T 16cafc1e2e814b5fb988b22554ac306b P 11edb01b3b184a2da8586fa5cffda90c 
(mpercy-c63s-0619-3.vpc.cloudera.com:7050): Error: Remote error: Not 
authorized: no authorization token presented
T 37d40c90b0614b5d9515d1458e31657c P de47be31840f4b349f970cf759097cec 
(mpercy-c63s-0619-2.vpc.cloudera.com:7050): Error: Remote error: Not 
authorized: no authorization token presented
T 5da264a95d31474ea4b0b2e464a5b261 P de47be31840f4b349f970cf759097cec 
(mpercy-c63s-0619-2.vpc.cloudera.com:7050): Error: Remote error: Not 
authorized: no authorization token presented
T 6acad06c945942b5af696f7f59b4d2ea P 7d31913f6bbf4355a974c76e4f82c72a 
(mpercy-c63s-0619-4.vpc.cloudera.com:7050): Error: Remote error: Not 
authorized: no authorization token presented
T 949f20e82db1467fa9f968853c901f11 P 11edb01b3b184a2da8586fa5cffda90c 
(mpercy-c63s-0619-3.vpc.cloudera.com:7050): Error: Remote error: Not 
authorized: no authorization token presented
T 9eda0d8c267c44efb9d76cf8fb911f93 P 921f4e7e28274a9189c978162d604f2e 
(mpercy-c63s-0619-5.vpc.cloudera.com:7050): Error: Remote error: Not 
authorized: no authorization token presented
T a1ec4565447c4628baa6a1c5f9765c7a P 11edb01b3b184a2da8586fa5cffda90c 
(mpercy-c63s-0619-3.vpc.cloudera.com:7050): Error: Remote error: Not 
authorized: no authorization token presented

==
Warnings:
==
Some masters have unsafe, experimental, or hidden flags set
Some tablet servers have unsafe, experimental, or hidden flags set

==
Errors:
==
Aborted: checksum scan error: 8 errors were detected

FAILED
Runtime error: ksck discovered errors{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (KUDU-1575) Backup and restore procedures

2019-06-10 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy resolved KUDU-1575.
--
   Resolution: Fixed
Fix Version/s: 1.10.0

Incremental backup / restore made it into 1.10.0. This still needs to be 
documented.

> Backup and restore procedures
> -
>
> Key: KUDU-1575
> URL: https://issues.apache.org/jira/browse/KUDU-1575
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, tserver
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
>  Labels: backup
> Fix For: 1.10.0
>
>
> Kudu needs backup and restore procedures, both for data and for metadata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2832) Clean up after a failed restore job

2019-05-30 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2832:
-
Component/s: backup

> Clean up after a failed restore job
> ---
>
> Key: KUDU-2832
> URL: https://issues.apache.org/jira/browse/KUDU-2832
> Project: Kudu
>  Issue Type: Improvement
>  Components: backup
>Reporter: Will Berkeley
>Priority: Major
>
> If a restore job fails, it may leave a partially-restored table on the 
> destination cluster. This will prevent a naive retry from succeeding. We 
> should make more effort to clean up if a restore job fails, so that a simple 
> retry of the same job might be able to succeed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2827) Backup should tombstone dropped tables

2019-05-28 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16850236#comment-16850236
 ] 

Mike Percy commented on KUDU-2827:
--

To add a little more context to this Jira, the implication is that we should 
have a way to determine whether a table was dropped or renamed, which would 
likely require additional master RPC API support, since we would need to be 
able to take a look at the current state of a table id. Table ids are used in 
the backup graph. The purpose is to properly handle dropped tables in the 
backup GC (backup cleanup) tool now merged as part of 
[https://github.com/apache/kudu/commit/a5a8da655ca8f0088dcd39301bd9bd87e182c460]

> Backup should tombstone dropped tables
> --
>
> Key: KUDU-2827
> URL: https://issues.apache.org/jira/browse/KUDU-2827
> Project: Kudu
>  Issue Type: Task
>  Components: backup
>Reporter: Mike Percy
>Priority: Major
>
> It would be useful for backup to "tombstone" dropped tables so that the GC 
> process can detect this and eventually consider these eligible for deletion, 
> even though they are still on the restore path from a backup graph 
> perspective.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2827) Backup should tombstone dropped tables

2019-05-24 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2827:


 Summary: Backup should tombstone dropped tables
 Key: KUDU-2827
 URL: https://issues.apache.org/jira/browse/KUDU-2827
 Project: Kudu
  Issue Type: Task
  Components: backup
Reporter: Mike Percy


It would be useful for backup to "tombstone" dropped tables so that the GC 
process can detect this and eventually consider these eligible for deletion, 
even though they are still on the restore path from a backup graph perspective.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2810) Restore needs DELETE_IGNORE

2019-05-02 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832126#comment-16832126
 ] 

Mike Percy commented on KUDU-2810:
--

Another option – more of a workaround – would be to simply handle the Not Found 
error specifically in the Restore job.

> Restore needs DELETE_IGNORE
> ---
>
> Key: KUDU-2810
> URL: https://issues.apache.org/jira/browse/KUDU-2810
> Project: Kudu
>  Issue Type: Bug
>  Components: backup
>Affects Versions: 1.9.0
>Reporter: Will Berkeley
>Priority: Major
> Fix For: 1.10.0
>
>
> If a restore task fails for any reason, and it's restoring an incremental 
> with DELETE row actions, when the task is retried it will fail any deletes 
> that happened on the previous task run. We need a DELETE_IGNORE write 
> operation to handle this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2809) Incremental backup / diff scan does not handle rows that are inserted and deleted between two incrementals correctly

2019-05-02 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832108#comment-16832108
 ] 

Mike Percy commented on KUDU-2809:
--

+1 on the correct solution here being that diff scan should not return the 
deleted row at all if the insert of the row was the first operation that 
happened after the start timestamp of the diff scan and the end state was 
deleted.

> Incremental backup / diff scan does not handle rows that are inserted and 
> deleted between two incrementals correctly
> 
>
> Key: KUDU-2809
> URL: https://issues.apache.org/jira/browse/KUDU-2809
> Project: Kudu
>  Issue Type: Bug
>  Components: backup
>Affects Versions: 1.9.0
>Reporter: Will Berkeley
>Priority: Major
>
> I did the following sequence of operations:
> # Insert 100 million rows
> # Update 1 out of every 11 rows
> # Make a full backup
> # Insert 100 million more rows, after the original rows in keyspace
> # Delete 1 out of every 23 rows
> # Make an incremental backup
> Restore failed to apply the incremental backup, failing with an error like
> {noformat}
> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to Kudu; 
> sample errors:
> {noformat}
> Due to another bug, there's no sample errors, but after hacking around that 
> bug, I found that the incremental contained a row with a DELETE action for a 
> key that is not present in the full backup. That's because the row was 
> inserted in step 4 and deleted in step 5, between backups.
> We could fix this by
> # Making diff scan not return a DELETE for such a row
> # Implementing and using DELETE IGNORE in the restore job



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2801) Support exact-match timestamp for restore

2019-04-24 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2801:


 Summary: Support exact-match timestamp for restore
 Key: KUDU-2801
 URL: https://issues.apache.org/jira/browse/KUDU-2801
 Project: Kudu
  Issue Type: Task
  Components: backup
Reporter: Mike Percy


If a user wants to restore a backup at a specific timestamp, we should allow 
for a flag to pass an exact-match timestamp instead of just an upper-bound 
timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2797) Implement table size metrics

2019-04-23 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2797:


 Summary: Implement table size metrics
 Key: KUDU-2797
 URL: https://issues.apache.org/jira/browse/KUDU-2797
 Project: Kudu
  Issue Type: Task
  Components: master, metrics
Affects Versions: 1.8.0
Reporter: Mike Percy


It would be valuable to implement table size metrics for row count and byte 
size (pre-replication and post-replication). The master could aggregate these 
stats from the various partitions (tablets) and expose aggregated metrics for 
consumption by monitoring systems and dashboards. These same metrics would also 
be valuable to display on the web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2692) Remove requirements for virtual columns to specify a read default and not be nullable

2019-04-23 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824423#comment-16824423
 ] 

Mike Percy commented on KUDU-2692:
--

This is low priority because all of the diff scan / incremental backup APIs are 
currently marked private; if we decided to make diff scan public this might be 
more important for usability.

> Remove requirements for virtual columns to specify a read default and not be 
> nullable
> -
>
> Key: KUDU-2692
> URL: https://issues.apache.org/jira/browse/KUDU-2692
> Project: Kudu
>  Issue Type: Improvement
>  Components: tablet
>Reporter: Mike Percy
>Priority: Minor
>  Labels: backup
>
> Virtual column types such as IS_DELETED currently require a read default to 
> be specified, in addition to not being allowed to be nullable. Consider 
> relaxing these requirements to improve the user experience when working with 
> virtual columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (KUDU-2678) [Backup] Ensure the restore job can load the data in order

2019-04-23 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy resolved KUDU-2678.
--
   Resolution: Won't Do
Fix Version/s: n/a

For now, we'll close this out, we can reopen if we suspect things have changed 
based on flush / compaction performance

> [Backup] Ensure the restore job can load the data in order
> --
>
> Key: KUDU-2678
> URL: https://issues.apache.org/jira/browse/KUDU-2678
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Minor
>  Labels: backup
> Fix For: n/a
>
>
> We need to adjust the Spark backup and restore jobs to be sure that we are 
> loading the data in sorted order. Not only is this useful for performance 
> today, but we may want to support some server side performance optimizations 
> in the future that depend on this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2678) [Backup] Ensure the restore job can load the data in order

2019-04-23 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824418#comment-16824418
 ] 

Mike Percy commented on KUDU-2678:
--

Based on the results of scale testing by Will, this doesn't help performance 
overall.

> [Backup] Ensure the restore job can load the data in order
> --
>
> Key: KUDU-2678
> URL: https://issues.apache.org/jira/browse/KUDU-2678
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Minor
>  Labels: backup
>
> We need to adjust the Spark backup and restore jobs to be sure that we are 
> loading the data in sorted order. Not only is this useful for performance 
> today, but we may want to support some server side performance optimizations 
> in the future that depend on this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2670) Splitting more tasks for spark job, and add more concurrent for scan operation

2019-04-23 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2670:
-
Labels: performance  (was: backup performance)

> Splitting more tasks for spark job, and add more concurrent for scan operation
> --
>
> Key: KUDU-2670
> URL: https://issues.apache.org/jira/browse/KUDU-2670
> Project: Kudu
>  Issue Type: Improvement
>  Components: java, spark
>Affects Versions: 1.8.0
>Reporter: yangz
>Priority: Major
>  Labels: performance
>
> Refer to the KUDU-2437 Split a tablet into primary key ranges by size.
> We need a java client implementation to support the split the tablet scan 
> operation.
> We suggest two new implementation for the java client.
>  # A ConcurrentKuduScanner to get more scanner read data at the same time. 
> This will be useful for one case.  We scanner only one row, but the predicate 
> doesn't contain the primary key, for this case, we will send a lot scanner 
> request but only one row return.It will be slow to send so much scanner 
> request one by one. So we need a concurrent way. And by this case we test, 
> for a 10G tablet, it will save a lot time for one machine.
>  # A way to split more spark task. To do so, we need get scanner tokens for 
> two step, first we send to the tserver to give range, then with this range we 
> get more scanner tokens. For our usage we make a tablet 10G, but we split a 
> task to process only 1G data. So we get better performance.
> And all this feature has run well for us for half a year. We hope this 
> feature will be useful for the community.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2795) Prevent cascading failures by detecting that disks are full and rejecting attempts to add additional replicas to a tablet server

2019-04-23 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2795:


 Summary: Prevent cascading failures by detecting that disks are 
full and rejecting attempts to add additional replicas to a tablet server
 Key: KUDU-2795
 URL: https://issues.apache.org/jira/browse/KUDU-2795
 Project: Kudu
  Issue Type: Task
  Components: master, tserver
Affects Versions: 1.8.0
Reporter: Mike Percy


Over the weekend a case was reported where the tablet server disks were 
near-full across a Kudu cluster. One finally reached the tipping point and 
crashed because the WAL disk was out of space and a write failed. This caused a 
cascading failure because the replicas on that tablet server were re-replicated 
to the rest of the cluster nodes, pushing them beyond the tipping point and 
eventually the whole cluster crashed.

We could potentially prevent the cascading failure by detecting that a tablet 
server is nearly full and reject or prevent attempts to move additional 
replicas to that server while it is in the "yellow zone" of disk space 
availability, preferring under-replicated tablets over an unavailable cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2794) Document how to identify and deal with KUDU-2233 corruption

2019-04-23 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2794:


 Summary: Document how to identify and deal with KUDU-2233 
corruption
 Key: KUDU-2794
 URL: https://issues.apache.org/jira/browse/KUDU-2794
 Project: Kudu
  Issue Type: Task
  Components: documentation, tablet
Reporter: Mike Percy


Document how to identify and deal with KUDU-2233 corruption. This would benefit 
from a tool to detect KUDU-2233 corruption like the one discussed in KUDU-2793.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2793) Design a scan to detect KUDU-2233 corruption in a replica

2019-04-23 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2793:


 Summary: Design a scan to detect KUDU-2233 corruption in a replica
 Key: KUDU-2793
 URL: https://issues.apache.org/jira/browse/KUDU-2793
 Project: Kudu
  Issue Type: Task
  Components: tablet
Affects Versions: 1.8.0
Reporter: Mike Percy


We should design a scan to detect corruption in a replica as a result of 
KUDU-2233. This may simply be a checksum scan, which we already support, but 
that has not been verified.

Today, when compaction is triggered in a KUDU-2233 corrupted replica, the 
tablet server will crash with a CHECK error. Ideally, when this detection scan 
notices such a corruption, it would cause the corrupt local replica to enter a 
FAILED tablet state. However, causing a crash might also be acceptable in 
controlled scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2792) Automatically retry failed bootstrap on tablets that failed to start due to disk space

2019-04-23 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2792:


 Summary: Automatically retry failed bootstrap on tablets that 
failed to start due to disk space
 Key: KUDU-2792
 URL: https://issues.apache.org/jira/browse/KUDU-2792
 Project: Kudu
  Issue Type: Task
  Components: tserver
Affects Versions: 1.8.0
Reporter: Mike Percy


If a tablet replica fails to bootstrap due to insufficient disk space to replay 
the WAL, it will remain in a state that looks like this in ksck, even if the 
user frees up disk space:

 
{code:java}
5edf82f0516b4897b3a7991a7e67d71c (host1.example.com:7050): not running [LEADER]
 State: FAILED
 Data state: TABLET_DATA_READY
 Last status: IO error: Failed log replay. Reason: Failed to open new log: 
Insufficient disk space to allocate 8388608 bytes under path 
/data/1/kudu/tablet/wal/wals/5807c5100e0d4522a66e32efbb29d57e/.kudutmp.newsegmentzGFKEg
 (7939936256 bytes available vs 19993874923 bytes reserved) (error 28)
{code}
Today, this requires a tablet server restart to recover from.

It should be possible for a tablet server (i.e. the TsTabletManager) to detect 
that the failure was temporary, not permanent, and retry the failed bootstrap 
later on when additional disk space has been freed. From a programming 
perspective, that may require dealing with some object lifecycle issues (i.e. 
not reusing the Tablet object from the failed bootstrap).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2783) ksck: indicate whether a tablet replica is recovering

2019-04-22 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2783:


 Summary: ksck: indicate whether a tablet replica is recovering
 Key: KUDU-2783
 URL: https://issues.apache.org/jira/browse/KUDU-2783
 Project: Kudu
  Issue Type: Task
  Components: ops-tooling
Reporter: Mike Percy


Got the following feedback from someone running Kudu.

Add an indicator to the ksck output indicating whether a table or replica is 
getting better or not, potentially by looking at whether the replica is 
bootstrapping. Something like an indicator like ‘this one will retry’ vs. ‘this 
one is not trying anymore’.

One way to do this would be to consider certain tablet data state + tablet 
state combinations such as INTIALIZING / BOOTSTRAPPING or COPYING as recovering 
and the rest of the bad ones as not making progress.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2655) Add metrics for metadata directory I/O

2019-04-22 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823538#comment-16823538
 ] 

Mike Percy commented on KUDU-2655:
--

This would be equally useful for performance questions around consensus 
metadata flush, which happens for configuration changes, leader changes, and 
voting.

> Add metrics for metadata directory I/O
> --
>
> Key: KUDU-2655
> URL: https://issues.apache.org/jira/browse/KUDU-2655
> Project: Kudu
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 1.8.0
>Reporter: Will Berkeley
>Assignee: Will Berkeley
>Priority: Major
>
> There's good metrics for block manager (data dir) and WAL operations, like 
> {{block_manager_total_bytes_written}}, {{block_manager_total_bytes_read}}, 
> {{log_bytes_logged }}, and the {{log_append_latency}} histogram. What we are 
> missing are metrics about the amount of metadata I/O. It'd be nice to add
> * metadata_bytes_read
> * metadata_bytes_written
> * latency histograms for bytes read and bytes written



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2782) Implement distributed tracing support in Kudu

2019-04-22 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2782:


 Summary: Implement distributed tracing support in Kudu
 Key: KUDU-2782
 URL: https://issues.apache.org/jira/browse/KUDU-2782
 Project: Kudu
  Issue Type: Task
  Components: ops-tooling
Reporter: Mike Percy


It would be useful to implement distributed tracing support in Kudu, especially 
something like OpenTracing support that we could use with Zipkin, Jaeger, 
DataDog, etc. Particularly useful would be auto-sampled and on-demand traces of 
write RPCs since that would help us identify slow nodes or hotspots in the 
replication group and troubleshoot performance and stability issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2019-03-26 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy reassigned KUDU-2727:


Assignee: Mike Percy

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Will Berkeley
>Assignee: Mike Percy
>Priority: Major
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2019-03-26 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802279#comment-16802279
 ] 

Mike Percy commented on KUDU-2727:
--

I'm going to look at this in my spare time

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Will Berkeley
>Assignee: Mike Percy
>Priority: Major
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2744) Add RPC support for diff scans

2019-03-26 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2744:
-
Status: In Review  (was: Open)

> Add RPC support for diff scans
> --
>
> Key: KUDU-2744
> URL: https://issues.apache.org/jira/browse/KUDU-2744
> Project: Kudu
>  Issue Type: Task
>  Components: backup
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
>
> Add RPC support for diff scans



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2744) Add RPC support for diff scans

2019-03-26 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2744:
-
   Resolution: Fixed
Fix Version/s: 1.10.0
   Status: Resolved  (was: In Review)

Merged as e8be768

> Add RPC support for diff scans
> --
>
> Key: KUDU-2744
> URL: https://issues.apache.org/jira/browse/KUDU-2744
> Project: Kudu
>  Issue Type: Task
>  Components: backup
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
> Fix For: 1.10.0
>
>
> Add RPC support for diff scans



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2744) Add RPC support for diff scans

2019-03-26 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2744:
-
Code Review: https://gerrit.cloudera.org/c/12592/

> Add RPC support for diff scans
> --
>
> Key: KUDU-2744
> URL: https://issues.apache.org/jira/browse/KUDU-2744
> Project: Kudu
>  Issue Type: Task
>  Components: backup
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
>
> Add RPC support for diff scans



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2744) Add RPC support for diff scans

2019-03-14 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2744:


 Summary: Add RPC support for diff scans
 Key: KUDU-2744
 URL: https://issues.apache.org/jira/browse/KUDU-2744
 Project: Kudu
  Issue Type: Task
  Components: backup
Reporter: Mike Percy
Assignee: Mike Percy


Add RPC support for diff scans



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (KUDU-2741) Failure in TestMergeIterator.TestDeDupGhostRows

2019-03-12 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy resolved KUDU-2741.
--
   Resolution: Fixed
Fix Version/s: 1.10.0

Fixed in d17e3ef345498777e32f2b275f952abac1369a7a

> Failure in TestMergeIterator.TestDeDupGhostRows
> ---
>
> Key: KUDU-2741
> URL: https://issues.apache.org/jira/browse/KUDU-2741
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Will Berkeley
>Assignee: Mike Percy
>Priority: Major
> Fix For: 1.10.0
>
>
> Test log of reproducible failure below:
> {noformat}
> $ bin/generic_iterators-test --gtest_filter="*DeDup*" 
> --gtest_random_seed=1615295598
> Note: Google Test filter = *DeDup*
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from TestMergeIterator
> [ RUN  ] TestMergeIterator.TestDeDupGhostRows
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0311 13:16:42.837129 199316928 test_util.cc:212] Using random seed: 
> 1078076534
> I0311 13:16:42.839583 199316928 generic_iterators-test.cc:317] Time spent 
> sorting the expected results: real 0.000s   user 0.000s sys 0.000s
> I0311 13:16:42.839709 199316928 generic_iterators-test.cc:321] Time spent 
> shuffling the inputs: real 0.000s   user 0.000s sys 0.000s
> I0311 13:16:42.839901 199316928 generic_iterators-test.cc:346] Predicate: val 
> >=  AND val < 
> ../../src/kudu/common/generic_iterators-test.cc:366: Failure
>   Expected: expected[total_idx]
>   Which is: 10264066
> To be equal to: row_val
>   Which is: 10282492
> Yielded out of order at idx 1823
> I0311 13:16:42.848778 199316928 generic_iterators-test.cc:348] Time spent 
> iterating merged lists: real 0.009s user 0.009s sys 0.000s
> ../../src/kudu/common/generic_iterators-test.cc:414: Failure
> Expected: TestMerge(kIntSchemaWithVCol, match_all_pred, true, true) doesn't 
> generate new fatal failures in the current thread.
>   Actual: it does.
> [  FAILED  ] TestMergeIterator.TestDeDupGhostRows (11 ms)
> [--] 1 test from TestMergeIterator (11 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (12 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] TestMergeIterator.TestDeDupGhostRows
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (KUDU-2741) Failure in TestMergeIterator.TestDeDupGhostRows

2019-03-11 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy reassigned KUDU-2741:


Assignee: Mike Percy

> Failure in TestMergeIterator.TestDeDupGhostRows
> ---
>
> Key: KUDU-2741
> URL: https://issues.apache.org/jira/browse/KUDU-2741
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Will Berkeley
>Assignee: Mike Percy
>Priority: Major
>
> Test log of reproducible failure below:
> {noformat}
> $ bin/generic_iterators-test --gtest_filter="*DeDup*" 
> --gtest_random_seed=1615295598
> Note: Google Test filter = *DeDup*
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from TestMergeIterator
> [ RUN  ] TestMergeIterator.TestDeDupGhostRows
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0311 13:16:42.837129 199316928 test_util.cc:212] Using random seed: 
> 1078076534
> I0311 13:16:42.839583 199316928 generic_iterators-test.cc:317] Time spent 
> sorting the expected results: real 0.000s   user 0.000s sys 0.000s
> I0311 13:16:42.839709 199316928 generic_iterators-test.cc:321] Time spent 
> shuffling the inputs: real 0.000s   user 0.000s sys 0.000s
> I0311 13:16:42.839901 199316928 generic_iterators-test.cc:346] Predicate: val 
> >=  AND val < 
> ../../src/kudu/common/generic_iterators-test.cc:366: Failure
>   Expected: expected[total_idx]
>   Which is: 10264066
> To be equal to: row_val
>   Which is: 10282492
> Yielded out of order at idx 1823
> I0311 13:16:42.848778 199316928 generic_iterators-test.cc:348] Time spent 
> iterating merged lists: real 0.009s user 0.009s sys 0.000s
> ../../src/kudu/common/generic_iterators-test.cc:414: Failure
> Expected: TestMerge(kIntSchemaWithVCol, match_all_pred, true, true) doesn't 
> generate new fatal failures in the current thread.
>   Actual: it does.
> [  FAILED  ] TestMergeIterator.TestDeDupGhostRows (11 ms)
> [--] 1 test from TestMergeIterator (11 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (12 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] TestMergeIterator.TestDeDupGhostRows
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2741) Failure in TestMergeIterator.TestDeDupGhostRows

2019-03-11 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16789932#comment-16789932
 ] 

Mike Percy commented on KUDU-2741:
--

Thanks for filing – I'm looking at this.

> Failure in TestMergeIterator.TestDeDupGhostRows
> ---
>
> Key: KUDU-2741
> URL: https://issues.apache.org/jira/browse/KUDU-2741
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Will Berkeley
>Assignee: Mike Percy
>Priority: Major
>
> Test log of reproducible failure below:
> {noformat}
> $ bin/generic_iterators-test --gtest_filter="*DeDup*" 
> --gtest_random_seed=1615295598
> Note: Google Test filter = *DeDup*
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from TestMergeIterator
> [ RUN  ] TestMergeIterator.TestDeDupGhostRows
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0311 13:16:42.837129 199316928 test_util.cc:212] Using random seed: 
> 1078076534
> I0311 13:16:42.839583 199316928 generic_iterators-test.cc:317] Time spent 
> sorting the expected results: real 0.000s   user 0.000s sys 0.000s
> I0311 13:16:42.839709 199316928 generic_iterators-test.cc:321] Time spent 
> shuffling the inputs: real 0.000s   user 0.000s sys 0.000s
> I0311 13:16:42.839901 199316928 generic_iterators-test.cc:346] Predicate: val 
> >=  AND val < 
> ../../src/kudu/common/generic_iterators-test.cc:366: Failure
>   Expected: expected[total_idx]
>   Which is: 10264066
> To be equal to: row_val
>   Which is: 10282492
> Yielded out of order at idx 1823
> I0311 13:16:42.848778 199316928 generic_iterators-test.cc:348] Time spent 
> iterating merged lists: real 0.009s user 0.009s sys 0.000s
> ../../src/kudu/common/generic_iterators-test.cc:414: Failure
> Expected: TestMerge(kIntSchemaWithVCol, match_all_pred, true, true) doesn't 
> generate new fatal failures in the current thread.
>   Actual: it does.
> [  FAILED  ] TestMergeIterator.TestDeDupGhostRows (11 ms)
> [--] 1 test from TestMergeIterator (11 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (12 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] TestMergeIterator.TestDeDupGhostRows
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2645) Diff scanner should perform a merge on the rowset iterators at scan time

2019-03-07 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2645:
-
   Resolution: Fixed
Fix Version/s: 1.10.0
   Status: Resolved  (was: In Review)

Merged as 5953357

> Diff scanner should perform a merge on the rowset iterators at scan time
> 
>
> Key: KUDU-2645
> URL: https://issues.apache.org/jira/browse/KUDU-2645
> Project: Kudu
>  Issue Type: New Feature
>  Components: tablet
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
> Fix For: 1.10.0
>
>
> In order to perform a diff scan we will need the MergeIterator to ensure that 
> duplicate ghost rows are not returned in cases where a row was deleted and 
> flushed, then reinserted into a new rowset during the time period covered by 
> the diff scan. In such a case, only one representation of the row should be 
> returned, which is the reinserted one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2740) TabletCopyITest.TestTabletCopyingDeletedTabletFails flaky due to lack of leader election retries

2019-03-07 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2740:


 Summary: TabletCopyITest.TestTabletCopyingDeletedTabletFails flaky 
due to lack of leader election retries
 Key: KUDU-2740
 URL: https://issues.apache.org/jira/browse/KUDU-2740
 Project: Kudu
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Mike Percy


This test can be flaky because we disable failure detection and neglect to 
retry the leader election. An example error looks like this:
{code:java}
I0307 01:24:56.238428 5333 tablet_service.cc:1239] Received Run Leader Election 
RPC: tablet_id: "e75f819cfb0a45c483899e2396b3a07a"
dest_uuid: "89a95beac49a43d0b02b662e1a228337"
from {username='slave'} at 127.0.0.1:48832
I0307 01:24:56.238809 5333 raft_consensus.cc:472] T 
e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [term 0 
FOLLOWER]: Starting forced leader election (received explicit request)
I0307 01:24:56.238982 5333 raft_consensus.cc:2886] T 
e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [term 0 
FOLLOWER]: Advancing to term 1
W0307 01:24:56.255393 5500 tablet.cc:1786] T e75f819cfb0a45c483899e2396b3a07a P 
7509a1eca3f14f45903715fdb6a20f77: Can't schedule compaction. Clean time has not 
been advanced past its initial value.
W0307 01:24:56.261915 5377 tablet.cc:1786] T e75f819cfb0a45c483899e2396b3a07a P 
89a95beac49a43d0b02b662e1a228337: Can't schedule compaction. Clean time has not 
been advanced past its initial value.
W0307 01:24:56.291085 5622 tablet.cc:1786] T e75f819cfb0a45c483899e2396b3a07a P 
0caf13c7f5a64af781811ca30ab3656d: Can't schedule compaction. Clean time has not 
been advanced past its initial value.
W0307 01:24:58.477632 5333 consensus_meta.cc:220] T 
e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337: Time spent 
flushing consensus metadata: real 2.238s user 0.003s sys 0.000s
I0307 01:24:58.477829 5333 raft_consensus.cc:494] T 
e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [term 1 
FOLLOWER]: Starting forced leader election with config: opid_index: -1 
OBSOLETE_local: false peers { permanent_uuid: 
"89a95beac49a43d0b02b662e1a228337" member_type: VOTER last_known_addr { host: 
"127.4.141.65" port: 44695 } } peers { permanent_uuid: 
"0caf13c7f5a64af781811ca30ab3656d" member_type: VOTER last_known_addr { host: 
"127.4.141.67" port: 32845 } } peers { permanent_uuid: 
"7509a1eca3f14f45903715fdb6a20f77" member_type: VOTER last_known_addr { host: 
"127.4.141.66" port: 35595 } }
I0307 01:24:58.479737 5333 leader_election.cc:296] T 
e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 
[CANDIDATE]: Term 1 election: Requested vote from peers 
0caf13c7f5a64af781811ca30ab3656d (127.4.141.67:32845), 
7509a1eca3f14f45903715fdb6a20f77 (127.4.141.66:35595)
I0307 01:24:58.480012 5333 rpcz_store.cc:269] Call 
kudu.consensus.ConsensusService.RunLeaderElection from 127.0.0.1:48832 (request 
call id 3) took 2241ms. Request Metrics: {"dns_us":93}
I0307 01:24:58.487798 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 
0.0 0.0 0.0
I0307 01:24:58.493844 5578 tablet_service.cc:1122] Received 
RequestConsensusVote() RPC: tablet_id: "e75f819cfb0a45c483899e2396b3a07a" 
candidate_uuid: "89a95beac49a43d0b02b662e1a228337" candidate_term: 1 
candidate_status { last_received { term: 0 index: 0 } } ignore_live_leader: 
true dest_uuid: "0caf13c7f5a64af781811ca30ab3656d"
I0307 01:24:58.494168 5578 raft_consensus.cc:2886] T 
e75f819cfb0a45c483899e2396b3a07a P 0caf13c7f5a64af781811ca30ab3656d [term 0 
FOLLOWER]: Advancing to term 1
I0307 01:24:58.494354 5456 tablet_service.cc:1122] Received 
RequestConsensusVote() RPC: tablet_id: "e75f819cfb0a45c483899e2396b3a07a" 
candidate_uuid: "89a95beac49a43d0b02b662e1a228337" candidate_term: 1 
candidate_status { last_received { term: 0 index: 0 } } ignore_live_leader: 
true dest_uuid: "7509a1eca3f14f45903715fdb6a20f77"
I0307 01:24:58.494655 5456 raft_consensus.cc:2886] T 
e75f819cfb0a45c483899e2396b3a07a P 7509a1eca3f14f45903715fdb6a20f77 [term 0 
FOLLOWER]: Advancing to term 1
W0307 01:24:59.988574 5267 leader_election.cc:341] T 
e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 
[CANDIDATE]: Term 1 election: RPC error from VoteRequest() call to peer 
7509a1eca3f14f45903715fdb6a20f77 (127.4.141.66:35595): Timed out: 
RequestConsensusVote RPC to 127.4.141.66:35595 timed out after 1.507s (SENT)
W0307 01:24:59.988920 5266 leader_election.cc:341] T 
e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 
[CANDIDATE]: Term 1 election: RPC error from VoteRequest() call to peer 
0caf13c7f5a64af781811ca30ab3656d (127.4.141.67:32845): Timed out: 
RequestConsensusVote RPC to 127.4.141.67:32845 timed out after 1.507s (SENT)
I0307 01:24:59.989068 5266 leader_election.cc:310] T 
e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 
[CANDIDATE]: Term 1 election: Election decided. Result: candidate lost.

[jira] [Created] (KUDU-2738) linked_list-test occasionally fails with webserver port bind failure: address already in use

2019-03-07 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2738:


 Summary: linked_list-test occasionally fails with webserver port 
bind failure: address already in use
 Key: KUDU-2738
 URL: https://issues.apache.org/jira/browse/KUDU-2738
 Project: Kudu
  Issue Type: Bug
  Components: test
Affects Versions: 1.9.0
Reporter: Mike Percy


Occasionally I see linked_list-test fail with the following error on Linux in 
an automated test environment:
{code:java}
E0306 23:35:25.207222 19523 webserver.cc:369] Webserver: set_ports_option: 
cannot bind to 127.14.25.194:49008: 98 (Address already in use)
W0306 23:35:25.207244 19523 net_util.cc:457] Trying to use lsof to find any 
processes listening on 0.0.0.0:49008
I0306 23:35:25.207249 19523 net_util.cc:460] $ export PATH=$PATH:/usr/sbin ; 
lsof -n -i 'TCP:49008' -sTCP:LISTEN ; for pid in $(lsof -F p -n -i 'TCP:49008' 
-sTCP:LISTEN | grep p | cut -f 2 -dp) ; do while [ $pid -gt 1 ] ; do ps h -fp 
$pid ; stat=($(

[jira] [Updated] (KUDU-2738) linked_list-test occasionally fails with webserver port bind failure: address already in use

2019-03-07 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2738:
-
Attachment: jenkins_output.txt.gz

> linked_list-test occasionally fails with webserver port bind failure: address 
> already in use
> 
>
> Key: KUDU-2738
> URL: https://issues.apache.org/jira/browse/KUDU-2738
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.9.0
>Reporter: Mike Percy
>Priority: Trivial
> Attachments: jenkins_output.txt.gz
>
>
> Occasionally I see linked_list-test fail with the following error on Linux in 
> an automated test environment:
> {code:java}
> E0306 23:35:25.207222 19523 webserver.cc:369] Webserver: set_ports_option: 
> cannot bind to 127.14.25.194:49008: 98 (Address already in use)
> W0306 23:35:25.207244 19523 net_util.cc:457] Trying to use lsof to find any 
> processes listening on 0.0.0.0:49008
> I0306 23:35:25.207249 19523 net_util.cc:460] $ export PATH=$PATH:/usr/sbin ; 
> lsof -n -i 'TCP:49008' -sTCP:LISTEN ; for pid in $(lsof -F p -n -i 
> 'TCP:49008' -sTCP:LISTEN | grep p | cut -f 2 -dp) ; do while [ $pid -gt 1 ] ; 
> do ps h -fp $pid ; stat=($( ...
> W0306 23:35:25.583075 19523 net_util.cc:467]
> F0306 23:35:25.583206 19523 tablet_server_main.cc:89] Check failed: _s.ok() 
> Bad status: Runtime error: Webserver: could not start on address 
> 127.14.25.194:49008: set_ports_option: cannot bind to 127.14.25.194:49008: 98 
> (Address already in use){code}
> I am not sure what would have bound to 0.0.0.0:49008 for a short period of 
> time, or used 126.14.25.194:49008 as an ephemeral address / port pair since 
> it's such a unique loopback IP address.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2736) RemoteKsckTest.TestClusterWithLocation is flaky

2019-03-05 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2736:


 Summary: RemoteKsckTest.TestClusterWithLocation is flaky
 Key: KUDU-2736
 URL: https://issues.apache.org/jira/browse/KUDU-2736
 Project: Kudu
  Issue Type: Improvement
  Components: test
Affects Versions: 1.9.0
Reporter: Mike Percy


RemoteKsckTest.TestClusterWithLocation is flaky

Alexey took a look at it and here is the analysis:

In essence, due to slowness of TSAN builds, connection negotiation from kudu 
CLI to one of master servers timed out, so one of the preconditions of the test 
didn't meet.  The error output by the test was:
{code:java}
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/ksck_remote-test.cc:523:
 Failure
Failed                                                                          
Bad status: Network error: failed to gather info from all masters: 1 of 3 had 
errors
{code}
The corresponding error in the master's log was:
{code:java}
W0221 12:38:27.119146 31380 negotiation.cc:313] Failed RPC negotiation. Trace:  
0221 12:38:23.949428 (+     0us) reactor.cc:583] Submitting negotiation task 
for client connection to 127.25.42.190:51799
0221 12:38:25.362220 (+1412792us) negotiation.cc:98] Waiting for socket to 
connect
0221 12:38:25.363489 (+  1269us) client_negotiation.cc:167] Beginning 
negotiation
0221 12:38:25.369976 (+  6487us) client_negotiation.cc:244] Sending NEGOTIATE 
NegotiatePB request
0221 12:38:25.431582 (+ 61606us) client_negotiation.cc:261] Received NEGOTIATE 
NegotiatePB response
0221 12:38:25.431610 (+    28us) client_negotiation.cc:355] Received NEGOTIATE 
response from server
0221 12:38:25.432659 (+  1049us) client_negotiation.cc:182] Negotiated 
authn=SASL
0221 12:38:27.051125 (+1618466us) client_negotiation.cc:483] Received 
TLS_HANDSHAKE response from server
0221 12:38:27.062085 (+ 10960us) client_negotiation.cc:471] Sending 
TLS_HANDSHAKE message to server
0221 12:38:27.062132 (+    47us) client_negotiation.cc:244] Sending 
TLS_HANDSHAKE NegotiatePB request
0221 12:38:27.064391 (+  2259us) negotiation.cc:304] Negotiation complete: 
Timed out: Client connection negotiation failed: client connection to 
127.25.42.190:51799: BlockingWrite timed out
{code}
We are seeing this on the flaky test dashboard for both TSAN and ASAN builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2735) RemoteKsckTest.TestClusterWithLocation is flaky

2019-03-05 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2735:


 Summary: RemoteKsckTest.TestClusterWithLocation is flaky
 Key: KUDU-2735
 URL: https://issues.apache.org/jira/browse/KUDU-2735
 Project: Kudu
  Issue Type: Improvement
  Components: test
Affects Versions: 1.9.0
Reporter: Mike Percy


RemoteKsckTest.TestClusterWithLocation is flaky

Alexey took a look at it and here is the analysis:

In essence, due to slowness of TSAN builds, connection negotiation from kudu 
CLI to one of master servers timed out, so one of the preconditions of the test 
didn't meet.  The error output by the test was:
{code:java}
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/ksck_remote-test.cc:523:
 Failure
Failed                                                                          
Bad status: Network error: failed to gather info from all masters: 1 of 3 had 
errors
{code}
The corresponding error in the master's log was:
{code:java}
W0221 12:38:27.119146 31380 negotiation.cc:313] Failed RPC negotiation. Trace:  
0221 12:38:23.949428 (+     0us) reactor.cc:583] Submitting negotiation task 
for client connection to 127.25.42.190:51799
0221 12:38:25.362220 (+1412792us) negotiation.cc:98] Waiting for socket to 
connect
0221 12:38:25.363489 (+  1269us) client_negotiation.cc:167] Beginning 
negotiation
0221 12:38:25.369976 (+  6487us) client_negotiation.cc:244] Sending NEGOTIATE 
NegotiatePB request
0221 12:38:25.431582 (+ 61606us) client_negotiation.cc:261] Received NEGOTIATE 
NegotiatePB response
0221 12:38:25.431610 (+    28us) client_negotiation.cc:355] Received NEGOTIATE 
response from server
0221 12:38:25.432659 (+  1049us) client_negotiation.cc:182] Negotiated 
authn=SASL
0221 12:38:27.051125 (+1618466us) client_negotiation.cc:483] Received 
TLS_HANDSHAKE response from server
0221 12:38:27.062085 (+ 10960us) client_negotiation.cc:471] Sending 
TLS_HANDSHAKE message to server
0221 12:38:27.062132 (+    47us) client_negotiation.cc:244] Sending 
TLS_HANDSHAKE NegotiatePB request
0221 12:38:27.064391 (+  2259us) negotiation.cc:304] Negotiation complete: 
Timed out: Client connection negotiation failed: client connection to 
127.25.42.190:51799: BlockingWrite timed out
{code}
We are seeing this on the flaky test dashboard for both TSAN and ASAN builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2734) RemoteKsckTest.TestClusterWithLocation is flaky

2019-03-05 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2734:


 Summary: RemoteKsckTest.TestClusterWithLocation is flaky
 Key: KUDU-2734
 URL: https://issues.apache.org/jira/browse/KUDU-2734
 Project: Kudu
  Issue Type: Improvement
  Components: test
Affects Versions: 1.9.0
Reporter: Mike Percy


RemoteKsckTest.TestClusterWithLocation is flaky

Alexey took a look at it and here is the analysis:

In essence, due to slowness of TSAN builds, connection negotiation from kudu 
CLI to one of master servers timed out, so one of the preconditions of the test 
didn't meet.  The error output by the test was:
{code:java}
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/ksck_remote-test.cc:523:
 Failure
Failed                                                                          
Bad status: Network error: failed to gather info from all masters: 1 of 3 had 
errors
{code}
The corresponding error in the master's log was:
{code:java}
W0221 12:38:27.119146 31380 negotiation.cc:313] Failed RPC negotiation. Trace:  
0221 12:38:23.949428 (+     0us) reactor.cc:583] Submitting negotiation task 
for client connection to 127.25.42.190:51799
0221 12:38:25.362220 (+1412792us) negotiation.cc:98] Waiting for socket to 
connect
0221 12:38:25.363489 (+  1269us) client_negotiation.cc:167] Beginning 
negotiation
0221 12:38:25.369976 (+  6487us) client_negotiation.cc:244] Sending NEGOTIATE 
NegotiatePB request
0221 12:38:25.431582 (+ 61606us) client_negotiation.cc:261] Received NEGOTIATE 
NegotiatePB response
0221 12:38:25.431610 (+    28us) client_negotiation.cc:355] Received NEGOTIATE 
response from server
0221 12:38:25.432659 (+  1049us) client_negotiation.cc:182] Negotiated 
authn=SASL
0221 12:38:27.051125 (+1618466us) client_negotiation.cc:483] Received 
TLS_HANDSHAKE response from server
0221 12:38:27.062085 (+ 10960us) client_negotiation.cc:471] Sending 
TLS_HANDSHAKE message to server
0221 12:38:27.062132 (+    47us) client_negotiation.cc:244] Sending 
TLS_HANDSHAKE NegotiatePB request
0221 12:38:27.064391 (+  2259us) negotiation.cc:304] Negotiation complete: 
Timed out: Client connection negotiation failed: client connection to 
127.25.42.190:51799: BlockingWrite timed out
{code}
We are seeing this on the flaky test dashboard for both TSAN and ASAN builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2390) ITClient fails with "Row count unexpectedly decreased"

2019-03-05 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785186#comment-16785186
 ] 

Mike Percy commented on KUDU-2390:
--

Another instance of this observed; attaching the failure log. I haven't spent 
that much time investigating this yet, but maybe the chaos thread restarting 
one of the tablet servers could have resulted in an under-count, since one of 
the tablet servers is restarted 2 seconds before we see the error in the log:

20:03:34.125 [INFO - Thread-5] (MiniKuduCluster.java:368) Killing tablet server 
127.1.121.66:48003
20:03:34.131 [INFO - Thread-5] (MiniKuduCluster.java:349) Starting tablet 
server 127.1.121.66:48003
20:03:36.094 [ERROR - Thread-7] (ITClient.java:135) Row count unexpectedly 
decreased from 87549 to 59949

> ITClient fails with "Row count unexpectedly decreased"
> --
>
> Key: KUDU-2390
> URL: https://issues.apache.org/jira/browse/KUDU-2390
> Project: Kudu
>  Issue Type: Bug
>  Components: java, test
>Affects Versions: 1.7.0, 1.8.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: Stdout.txt.gz, TEST-org.apache.kudu.client.ITClient.xml, 
> TEST-org.apache.kudu.client.ITClient.xml.gz, 
> TEST-org.apache.kudu.client.ITClient.xml.xz
>
>
> On master, hit the following failure of ITClient:
> {code}
> 20:05:05.407 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:934) 
> AsyncKuduScanner$Response(scannerId = "6ddf5d0da48241aea4b9eb51645716cc", 
> data = RowResultIterator for 27600 rows, more = true, responseScanTimestamp = 
> 6234957022375723008) for scanner
> 20:05:05.407 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:447) Scanner 
> "6ddf5d0da48241aea4b9eb51645716cc" opened on 
> d78cb5506f6e4e17bd54fdaf1819a8a2@[729d64003e7740cabb650f8f6aea4af6(127.1.76.194:60468),7a2e5f9b2be9497fadc30b81a6a50b24(127.1.76.19
> 20:05:05.409 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:934) 
> AsyncKuduScanner$Response(scannerId = "", data = RowResultIterator for 7314 
> rows, more = false) for scanner 
> KuduScanner(table=org.apache.kudu.client.ITClient-1522206255318, tablet=d78c
> 20:05:05.409 [INFO - Thread-4] (ITClient.java:397) New row count 90114
> 20:05:05.414 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:934) 
> AsyncKuduScanner$Response(scannerId = "c230614ad13e40478254b785995d1d7c", 
> data = RowResultIterator for 27600 rows, more = true, responseScanTimestamp = 
> 6234957022413987840) for scanner
> 20:05:05.414 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:447) Scanner 
> "c230614ad13e40478254b785995d1d7c" opened on 
> d78cb5506f6e4e17bd54fdaf1819a8a2@[729d64003e7740cabb650f8f6aea4af6(127.1.76.194:60468),7a2e5f9b2be9497fadc30b81a6a50b24(127.1.76.19
> 20:05:05.419 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:934) 
> AsyncKuduScanner$Response(scannerId = "", data = RowResultIterator for 27600 
> rows, more = true) for scanner 
> KuduScanner(table=org.apache.kudu.client.ITClient-1522206255318, tablet=d78c
> 20:05:05.420 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:934) 
> AsyncKuduScanner$Response(scannerId = "", data = RowResultIterator for 7342 
> rows, more = false) for scanner 
> KuduScanner(table=org.apache.kudu.client.ITClient-1522206255318, tablet=d78c
> 20:05:05.421 [ERROR - Thread-4] (ITClient.java:134) Row count unexpectedly 
> decreased from 90114to 62542
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2390) ITClient fails with "Row count unexpectedly decreased"

2019-03-05 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2390:
-
Attachment: TEST-org.apache.kudu.client.ITClient.xml.gz

> ITClient fails with "Row count unexpectedly decreased"
> --
>
> Key: KUDU-2390
> URL: https://issues.apache.org/jira/browse/KUDU-2390
> Project: Kudu
>  Issue Type: Bug
>  Components: java, test
>Affects Versions: 1.7.0, 1.8.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: Stdout.txt.gz, TEST-org.apache.kudu.client.ITClient.xml, 
> TEST-org.apache.kudu.client.ITClient.xml.gz, 
> TEST-org.apache.kudu.client.ITClient.xml.xz
>
>
> On master, hit the following failure of ITClient:
> {code}
> 20:05:05.407 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:934) 
> AsyncKuduScanner$Response(scannerId = "6ddf5d0da48241aea4b9eb51645716cc", 
> data = RowResultIterator for 27600 rows, more = true, responseScanTimestamp = 
> 6234957022375723008) for scanner
> 20:05:05.407 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:447) Scanner 
> "6ddf5d0da48241aea4b9eb51645716cc" opened on 
> d78cb5506f6e4e17bd54fdaf1819a8a2@[729d64003e7740cabb650f8f6aea4af6(127.1.76.194:60468),7a2e5f9b2be9497fadc30b81a6a50b24(127.1.76.19
> 20:05:05.409 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:934) 
> AsyncKuduScanner$Response(scannerId = "", data = RowResultIterator for 7314 
> rows, more = false) for scanner 
> KuduScanner(table=org.apache.kudu.client.ITClient-1522206255318, tablet=d78c
> 20:05:05.409 [INFO - Thread-4] (ITClient.java:397) New row count 90114
> 20:05:05.414 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:934) 
> AsyncKuduScanner$Response(scannerId = "c230614ad13e40478254b785995d1d7c", 
> data = RowResultIterator for 27600 rows, more = true, responseScanTimestamp = 
> 6234957022413987840) for scanner
> 20:05:05.414 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:447) Scanner 
> "c230614ad13e40478254b785995d1d7c" opened on 
> d78cb5506f6e4e17bd54fdaf1819a8a2@[729d64003e7740cabb650f8f6aea4af6(127.1.76.194:60468),7a2e5f9b2be9497fadc30b81a6a50b24(127.1.76.19
> 20:05:05.419 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:934) 
> AsyncKuduScanner$Response(scannerId = "", data = RowResultIterator for 27600 
> rows, more = true) for scanner 
> KuduScanner(table=org.apache.kudu.client.ITClient-1522206255318, tablet=d78c
> 20:05:05.420 [DEBUG - New I/O worker #17] (AsyncKuduScanner.java:934) 
> AsyncKuduScanner$Response(scannerId = "", data = RowResultIterator for 7342 
> rows, more = false) for scanner 
> KuduScanner(table=org.apache.kudu.client.ITClient-1522206255318, tablet=d78c
> 20:05:05.421 [ERROR - Thread-4] (ITClient.java:134) Row count unexpectedly 
> decreased from 90114to 62542
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2733) ITClient java test flaky: chaos thread failure: Couldn't restart a TS

2019-03-05 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2733:


 Summary: ITClient java test flaky: chaos thread failure: Couldn't 
restart a TS
 Key: KUDU-2733
 URL: https://issues.apache.org/jira/browse/KUDU-2733
 Project: Kudu
  Issue Type: Improvement
  Components: java, test
Affects Versions: 1.9.0
Reporter: Mike Percy
 Attachments: TEST-org.apache.kudu.client.ITClient.xml

Sometimes in ITClient.test(), the chaos thread cannot restart the tablet 
server. The error looks like this:
{code:java}
03:53:33.233 [ERROR - Thread-13] (ITClient.java:135) Couldn't restart a TS
java.lang.RuntimeException: Tablet server 127.26.66.66:38801 not found
at 
org.apache.kudu.test.cluster.MiniKuduCluster.getTabletServer(MiniKuduCluster.java:513)
at 
org.apache.kudu.test.cluster.MiniKuduCluster.killTabletServer(MiniKuduCluster.java:364)
at 
org.apache.kudu.test.KuduTestHarness.restartTabletServer(KuduTestHarness.java:285)
at org.apache.kudu.client.ITClient$ChaosThread.restartTS(ITClient.java:207)
at org.apache.kudu.client.ITClient$ChaosThread.run(ITClient.java:158)
at java.lang.Thread.run(Thread.java:745)
{code}
Attaching a test log.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (KUDU-2411) Create a public test utility artifact

2019-03-04 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy resolved KUDU-2411.
--
   Resolution: Fixed
Fix Version/s: 1.9.0

This capability made it into 1.9.0

> Create a public test utility artifact
> -
>
> Key: KUDU-2411
> URL: https://issues.apache.org/jira/browse/KUDU-2411
> Project: Kudu
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Major
>  Labels: community
> Fix For: 1.9.0
>
>
> Create a public published test utility jar that contains useful testing 
> utilities for applications that integrate with Kudu including things like 
> BaseKuduTest.java and MiniKuduCluster.java.
> This has the added benefit of eliminating the unusual dependency on all of 
> kudu-clients test in each of the other java modules. This could likely be 
> used in our examples code too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-1868) Java client mishandles socket read timeouts for scans

2019-03-04 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783805#comment-16783805
 ] 

Mike Percy commented on KUDU-1868:
--

Merged as part of these patches from Will:
 * [https://gerrit.cloudera.org/c/12338/]
 * [https://gerrit.cloudera.org/c/12363/]

 

> Java client mishandles socket read timeouts for scans
> -
>
> Key: KUDU-1868
> URL: https://issues.apache.org/jira/browse/KUDU-1868
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 1.2.0
>Reporter: Jean-Daniel Cryans
>Assignee: Will Berkeley
>Priority: Major
>  Labels: backup
>
> Scan calls from the Java client that take more than the socket read timeout 
> get retried (unless the operation timeout has expired) instead of being 
> killed. Users will see this:
> {code}
> org.apache.kudu.client.NonRecoverableException: Invalid call sequence ID in 
> scan request
> {code}
> Note that the right behavior here would still end up killing the scanner, so 
> this is really a problem the user has to deal with! It's usually caused by 
> slow IO, combined with very selection scans.
> Workaround: set defaultSocketReadTimeoutMs higher, ideally equal to 
> defaultOperationTimeoutMs (the defaults are 10 and 30 seconds respectively). 
> But really the user should investigate why single the scans are so slow.
> One potentially easy fix to this is to handle retries differently for 
> scanners so that the user gets nicer exception. A harder fix is to handle 
> socket read timeouts completely differently, basically it should be per-RPC 
> and not per TabletClient like it is right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (KUDU-1868) Java client mishandles socket read timeouts for scans

2019-03-04 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy resolved KUDU-1868.
--
   Resolution: Fixed
Fix Version/s: 1.9.0

> Java client mishandles socket read timeouts for scans
> -
>
> Key: KUDU-1868
> URL: https://issues.apache.org/jira/browse/KUDU-1868
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 1.2.0
>Reporter: Jean-Daniel Cryans
>Assignee: Will Berkeley
>Priority: Major
>  Labels: backup
> Fix For: 1.9.0
>
>
> Scan calls from the Java client that take more than the socket read timeout 
> get retried (unless the operation timeout has expired) instead of being 
> killed. Users will see this:
> {code}
> org.apache.kudu.client.NonRecoverableException: Invalid call sequence ID in 
> scan request
> {code}
> Note that the right behavior here would still end up killing the scanner, so 
> this is really a problem the user has to deal with! It's usually caused by 
> slow IO, combined with very selection scans.
> Workaround: set defaultSocketReadTimeoutMs higher, ideally equal to 
> defaultOperationTimeoutMs (the defaults are 10 and 30 seconds respectively). 
> But really the user should investigate why single the scans are so slow.
> One potentially easy fix to this is to handle retries differently for 
> scanners so that the user gets nicer exception. A harder fix is to handle 
> socket read timeouts completely differently, basically it should be per-RPC 
> and not per TabletClient like it is right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2724) Binary jar build on OSX should specify target macos version

2019-03-02 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2724:


 Summary: Binary jar build on OSX should specify target macos 
version
 Key: KUDU-2724
 URL: https://issues.apache.org/jira/browse/KUDU-2724
 Project: Kudu
  Issue Type: Improvement
Reporter: Mike Percy


The binary test jar build should use one of the commonly-used options to 
specify a target macOS version when building the binary jar, so that it isn't 
required to build on an old platform to get wide compatibility.

The common methods seem to be documented here:

[https://cmake.org/cmake/help/v3.0/variable/CMAKE_OSX_DEPLOYMENT_TARGET.html]

These include specifying the compiler flag -mmacosx-version-min, the 
environment variable MACOSX_DEPLOYMENT_TARGET, or the CMake variable 
CMAKE_OSX_DEPLOYMENT_TARGET.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2696) libgmock is linked into the kudu cli binary

2019-02-10 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2696:


 Summary: libgmock is linked into the kudu cli binary
 Key: KUDU-2696
 URL: https://issues.apache.org/jira/browse/KUDU-2696
 Project: Kudu
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Mike Percy


libgmock is linked into the kudu cli binary, even though we consider it a 
test-only dependency. Possibly a configuration problem in our cmake files?
{code:java}
$ ldd build/dynclang/bin/kudu | grep mock
 libgmock.so => 
/home/mpercy/src/kudu/thirdparty/installed/uninstrumented/lib/libgmock.so 
(0x7f01f1495000)
{code}
The gmock dependency does not appear in the server binaries, as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2694) DeleteTabletITest.TestLeaderElectionDuringDeleteTablet is flaky

2019-02-09 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2694:


 Summary: DeleteTabletITest.TestLeaderElectionDuringDeleteTablet is 
flaky
 Key: KUDU-2694
 URL: https://issues.apache.org/jira/browse/KUDU-2694
 Project: Kudu
  Issue Type: Bug
  Components: consensus
Reporter: Mike Percy
 Attachments: delete_tablet-itest.txt.gz

DeleteTabletITest.TestLeaderElectionDuringDeleteTablet is slightly flaky and 
reporting bad health from the leader in some cases. Attaching log file from a 
dist-test flaky-test job run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2645) Diff scanner should perform a merge on the rowset iterators at scan time

2019-02-08 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2645:
-
Status: In Review  (was: In Progress)

> Diff scanner should perform a merge on the rowset iterators at scan time
> 
>
> Key: KUDU-2645
> URL: https://issues.apache.org/jira/browse/KUDU-2645
> Project: Kudu
>  Issue Type: New Feature
>  Components: tablet
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
>
> In order to perform a diff scan we will need the MergeIterator to ensure that 
> duplicate ghost rows are not returned in cases where a row was deleted and 
> flushed, then reinserted into a new rowset during the time period covered by 
> the diff scan. In such a case, only one representation of the row should be 
> returned, which is the reinserted one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2645) Diff scanner should perform a merge on the rowset iterators at scan time

2019-02-08 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2645:
-
Code Review: https://gerrit.cloudera.org/c/12205/

> Diff scanner should perform a merge on the rowset iterators at scan time
> 
>
> Key: KUDU-2645
> URL: https://issues.apache.org/jira/browse/KUDU-2645
> Project: Kudu
>  Issue Type: New Feature
>  Components: tablet
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
>
> In order to perform a diff scan we will need the MergeIterator to ensure that 
> duplicate ghost rows are not returned in cases where a row was deleted and 
> flushed, then reinserted into a new rowset during the time period covered by 
> the diff scan. In such a case, only one representation of the row should be 
> returned, which is the reinserted one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2693) Buffer DiskRowSet flushes to more efficiently write many columns

2019-02-08 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2693:


 Summary: Buffer DiskRowSet flushes to more efficiently write many 
columns
 Key: KUDU-2693
 URL: https://issues.apache.org/jira/browse/KUDU-2693
 Project: Kudu
  Issue Type: Improvement
  Components: fs, tablet
Affects Versions: 1.9.0
Reporter: Mike Percy


When looking at a trace of some MRS flushes on a table with 280 columns, it was 
observed that during the course of the flush some 695 fdatasync() calls 
occurred.

One possible way to minimize the number of fsync calls would be to flush 
directly to memory buffers first, determine the ideal layout on disk for the 
flushed blocks (possibly striped across one log block container per data disk) 
and then potentially write the data out to the containers in parallel. This 
would require some memory buffer space to be reserved per maintenance manager 
thread, possibly 64MB since the DRS roll size is 32MB.

According to Todd we could probably do it all in LogBlockManager by adding a 
new flag to CreateBlockOptions that says whether to buffer or something like 
that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2692) Remove requirements for virtual columns to specify a read default and not be nullable

2019-02-08 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2692:


 Summary: Remove requirements for virtual columns to specify a read 
default and not be nullable
 Key: KUDU-2692
 URL: https://issues.apache.org/jira/browse/KUDU-2692
 Project: Kudu
  Issue Type: Improvement
  Components: tablet
Reporter: Mike Percy


Virtual column types such as IS_DELETED currently require a read default to be 
specified, in addition to not being allowed to be nullable. Consider relaxing 
these requirements to improve the user experience when working with virtual 
columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2691) AlterTable transactions should anchor their ops in the WAL

2019-02-08 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2691:
-
Component/s: consensus

> AlterTable transactions should anchor their ops in the WAL
> --
>
> Key: KUDU-2691
> URL: https://issues.apache.org/jira/browse/KUDU-2691
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, log, tablet
>Affects Versions: 1.9.0
>Reporter: Mike Percy
>Priority: Major
>
> AlterTable does not appear to anchor its WAL ops, meaning there is nothing 
> preventing Kudu from GCing a WAL segment including an AlterTable that is 
> running very slowly for some reason. If that happens and then the tserver is 
> killed, it's possible for that replica to fail to start back up later. We 
> should anchor alter ops in the same way we anchor write operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2691) AlterTable transactions should anchor their ops in the WAL

2019-02-08 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2691:
-
Component/s: tablet

> AlterTable transactions should anchor their ops in the WAL
> --
>
> Key: KUDU-2691
> URL: https://issues.apache.org/jira/browse/KUDU-2691
> Project: Kudu
>  Issue Type: Bug
>  Components: log, tablet
>Affects Versions: 1.9.0
>Reporter: Mike Percy
>Priority: Major
>
> AlterTable does not appear to anchor its WAL ops, meaning there is nothing 
> preventing Kudu from GCing a WAL segment including an AlterTable that is 
> running very slowly for some reason. If that happens and then the tserver is 
> killed, it's possible for that replica to fail to start back up later. We 
> should anchor alter ops in the same way we anchor write operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2691) AlterTable transactions should anchor their ops in the WAL

2019-02-08 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2691:


 Summary: AlterTable transactions should anchor their ops in the WAL
 Key: KUDU-2691
 URL: https://issues.apache.org/jira/browse/KUDU-2691
 Project: Kudu
  Issue Type: Bug
  Components: log
Affects Versions: 1.9.0
Reporter: Mike Percy


AlterTable does not appear to anchor its WAL ops, meaning there is nothing 
preventing Kudu from GCing a WAL segment including an AlterTable that is 
running very slowly for some reason. If that happens and then the tserver is 
killed, it's possible for that replica to fail to start back up later. We 
should anchor alter ops in the same way we anchor write operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2676) Restore: Support creating tables with greater than the maximum allowed number of partitions

2019-02-06 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2676:
-
Summary: Restore: Support creating tables with greater than the maximum 
allowed number of partitions  (was: [Backup] Support restoring tables over the 
maximum allowed replicas)

> Restore: Support creating tables with greater than the maximum allowed number 
> of partitions
> ---
>
> Key: KUDU-2676
> URL: https://issues.apache.org/jira/browse/KUDU-2676
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Grant Henke
>Priority: Major
>  Labels: backup
>
> Currently it is possible to backup a table that has more partitions than are 
> allowed at create time. 
> This results in the restore job failing with the following exception:
> {noformat}
> 19/01/24 08:17:14 INFO backup.KuduRestore$: Restoring from path: 
> hdfs:///user/ghenke/kudu-backup-tests/20190124-080741
> Exception in thread "main" org.apache.kudu.client.NonRecoverableException: 
> the requested number of tablet replicas is over the maximum permitted at 
> creation time (
> 450), additional tablets may be added by adding range partitions to the table 
> post-creation
> at 
> org.apache.kudu.client.KuduException.transformException(KuduException.java:110)
> at 
> org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:365)
> at org.apache.kudu.client.KuduClient.createTable(KuduClient.java:109)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2665) BlockManagerStressTest.StressTest is extremely flaky

2019-01-22 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2665:


 Summary: BlockManagerStressTest.StressTest is extremely flaky
 Key: KUDU-2665
 URL: https://issues.apache.org/jira/browse/KUDU-2665
 Project: Kudu
  Issue Type: New Feature
  Components: fs
Reporter: Mike Percy


After some recent block manager changes the Block Manager Stress Test is about 
50% flaky on certain precommit builds. The failure looks like this:
{code:java}
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/fs/block_manager-stress-test.cc:518:
 Failure
Failed
Bad status: Not found: 
/data/somelongdirectorytoavoidrpathissues/src/kudutest/block_manager-stress-test.0.BlockManagerStressTest_1.StressTest.1547778831841692-23619/data/e8ab31ef3e2143a5bc6d7a2b40e7805b.data:
 No such file or directory (error 2)
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/fs/block_manager-stress-test.cc:549:
 Failure
Expected: this->InjectNonFatalInconsistencies() doesn't generate new fatal 
failures in the current thread.
 Actual: it does.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2195) Enforce durability happened before relationships on multiple disks

2019-01-08 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737707#comment-16737707
 ] 

Mike Percy commented on KUDU-2195:
--

Here is the aforementioned band-aid patch for review: 
https://gerrit.cloudera.org/c/12186/

> Enforce durability happened before relationships on multiple disks
> --
>
> Key: KUDU-2195
> URL: https://issues.apache.org/jira/browse/KUDU-2195
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tablet
>Reporter: David Alves
>Priority: Major
>
> When using weaker durability semantics (e.g. when log_force_fsync is off) we 
> should still enforce certain happened before relationships which are not 
> currently being enforced when using different disks for the wal and data.
> The two cases that come to mind where this is relevant are:
> 1) cmeta (c) -> wal (w) : We flush cmeta before flushing the wal (for 
> instance on term change) with the intention that either {}, \{c} or \{c, w} 
> were made durable.
> 2) wal (w) -> tablet meta (t): We flush the wal before tablet metadata to 
> make sure that that all commit messages that refer to on disk row sets (and 
> deltas) are on disk before the row sets they point to, i.e. with the 
> intention that either {}, \{w} or \{w, t} were made durable.
> With strong durability semantics these are always made durable in the right 
> order. With weaker semantics that is not the case though. If using the same 
> disk for both the wal and data then the invariants are  still preserved, as 
> buffers will be flushed in the right order but if using different disks for 
> the wal and data (and because cmeta is stored with the data) that is not 
> always the case.
> 1) in ext4 is actually safe, because we perform an fsync (indirect, rename() 
> implies fsync in ext4) when flushing cmeta. But it is not for xfs.
> 2) Is not safe in either filesystem.
> --- Possible solutions --
> For 1): Store cmeta with the wal; actually always fsync cmeta.
> For 2): Store tablet meta with the wal; always fsync the wal before flushing 
> tablet meta.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2195) Enforce durability happened before relationships on multiple disks

2019-01-08 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737688#comment-16737688
 ] 

Mike Percy commented on KUDU-2195:
--

This was recently seen in the wild again. Usually it's people running XFS that 
experience a power outage who see 0-length cmeta files. We should consider 
adding a gflag for just the cmeta files so that people running with XFS have a 
band-aid.

> Enforce durability happened before relationships on multiple disks
> --
>
> Key: KUDU-2195
> URL: https://issues.apache.org/jira/browse/KUDU-2195
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tablet
>Reporter: David Alves
>Priority: Major
>
> When using weaker durability semantics (e.g. when log_force_fsync is off) we 
> should still enforce certain happened before relationships which are not 
> currently being enforced when using different disks for the wal and data.
> The two cases that come to mind where this is relevant are:
> 1) cmeta (c) -> wal (w) : We flush cmeta before flushing the wal (for 
> instance on term change) with the intention that either {}, \{c} or \{c, w} 
> were made durable.
> 2) wal (w) -> tablet meta (t): We flush the wal before tablet metadata to 
> make sure that that all commit messages that refer to on disk row sets (and 
> deltas) are on disk before the row sets they point to, i.e. with the 
> intention that either {}, \{w} or \{w, t} were made durable.
> With strong durability semantics these are always made durable in the right 
> order. With weaker semantics that is not the case though. If using the same 
> disk for both the wal and data then the invariants are  still preserved, as 
> buffers will be flushed in the right order but if using different disks for 
> the wal and data (and because cmeta is stored with the data) that is not 
> always the case.
> 1) in ext4 is actually safe, because we perform an fsync (indirect, rename() 
> implies fsync in ext4) when flushing cmeta. But it is not for xfs.
> 2) Is not safe in either filesystem.
> --- Possible solutions --
> For 1): Store cmeta with the wal; actually always fsync cmeta.
> For 2): Store tablet meta with the wal; always fsync the wal before flushing 
> tablet meta.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2652) TsRecoveryITest.TestNoBlockIDReuseIfMissingBlocks potentially flaky

2019-01-03 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2652:


 Summary: TsRecoveryITest.TestNoBlockIDReuseIfMissingBlocks 
potentially flaky
 Key: KUDU-2652
 URL: https://issues.apache.org/jira/browse/KUDU-2652
 Project: Kudu
  Issue Type: New Feature
Reporter: Mike Percy
 Attachments: ts_recovery-itest.txt.gz

This test failed for me in a Gerrit pre-commit run with an unrelated change @ 
[http://jenkins.kudu.apache.org/job/kudu-gerrit/15885]

The error was:
{code:java}
/home/jenkins-slave/workspace/kudu-master/3/src/kudu/integration-tests/ts_recovery-itest.cc:298:
 Failure
Value of: !orphaned_block_ids.empty()
 Actual: false
Expected: true
/home/jenkins-slave/workspace/kudu-master/3/src/kudu/util/test_util.cc:323: 
Failure
Failed
Timed out waiting for assertion to pass.
{code}
I am attaching the error log.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2645) Diff scanner should perform a merge on the rowset iterators at scan time

2018-12-18 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2645:


 Summary: Diff scanner should perform a merge on the rowset 
iterators at scan time
 Key: KUDU-2645
 URL: https://issues.apache.org/jira/browse/KUDU-2645
 Project: Kudu
  Issue Type: New Feature
  Components: tablet
Reporter: Mike Percy


In order to perform a diff scan we will need the MergeIterator to ensure that 
duplicate ghost rows are not returned in cases where a row was deleted and 
flushed, then reinserted into a new rowset during the time period covered by 
the diff scan. In such a case, only one representation of the row should be 
returned, which is the reinserted one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (KUDU-2645) Diff scanner should perform a merge on the rowset iterators at scan time

2018-12-18 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy reassigned KUDU-2645:


Assignee: Mike Percy

> Diff scanner should perform a merge on the rowset iterators at scan time
> 
>
> Key: KUDU-2645
> URL: https://issues.apache.org/jira/browse/KUDU-2645
> Project: Kudu
>  Issue Type: New Feature
>  Components: tablet
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
>
> In order to perform a diff scan we will need the MergeIterator to ensure that 
> duplicate ghost rows are not returned in cases where a row was deleted and 
> flushed, then reinserted into a new rowset during the time period covered by 
> the diff scan. In such a case, only one representation of the row should be 
> returned, which is the reinserted one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-1563) Add support for INSERT IGNORE

2018-12-17 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723538#comment-16723538
 ] 

Mike Percy commented on KUDU-1563:
--

I know I'm late to this party. I think it's worth modeling what SQL does and 
INSERT IGNORE in that context operates at a batch or operation level, not a 
session level. So it seems more of an impedance match to keep this type of 
error handling configuration at the operation or batch level from a client API 
perspective to avoid requiring SQL clients to constantly be setting session 
options if they are caching sessions.

> Add support for INSERT IGNORE
> -
>
> Key: KUDU-1563
> URL: https://issues.apache.org/jira/browse/KUDU-1563
> Project: Kudu
>  Issue Type: New Feature
>Reporter: Dan Burkert
>Assignee: Brock Noland
>Priority: Major
>  Labels: newbie
>
> The Java client currently has an [option to ignore duplicate row key errors| 
> https://kudu.apache.org/apidocs/org/kududb/client/AsyncKuduSession.html#setIgnoreAllDuplicateRows-boolean-],
>  which is implemented by filtering the errors on the client side.  If we are 
> going to continue to support this feature (and the consensus seems to be that 
> we probably should), we should promote it to a first class operation type 
> that is handled on the server side.  This would have a modest perf. 
> improvement since less errors are returned, and it would allow INSERT IGNORE 
> ops to be mixed in the same batch as other INSERT, DELETE, UPSERT, etc. ops.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-1575) Backup and restore procedures

2018-12-13 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720484#comment-16720484
 ] 

Mike Percy commented on KUDU-1575:
--

Hey Tim latest progress on this is we have some of the low-level work done but 
still working on finishing up the ability to do diff scans which are the basis 
for incremental backups. Once we finish that there is quite a bit of work left 
to implement restore of incremental backups, plus a lot of testing to ensure 
perf / scale / stability are all acceptable. No commitment on timeline but I am 
hoping a basic version of backup makes it out in the next release or two of 
Kudu.

> Backup and restore procedures
> -
>
> Key: KUDU-1575
> URL: https://issues.apache.org/jira/browse/KUDU-1575
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, tserver
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
>
> Kudu needs backup and restore procedures, both for data and for metadata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2629) TestHybridTime is flaky

2018-11-29 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703876#comment-16703876
 ] 

Mike Percy commented on KUDU-2629:
--

Saw the same error in a test run today. The error message was:
{code:java}
java.lang.AssertionError: expected:<4> but was:<3>
 at org.junit.Assert.fail(Assert.java:88)
 at org.junit.Assert.failNotEquals(Assert.java:834)
 at org.junit.Assert.assertEquals(Assert.java:645)
 at org.junit.Assert.assertEquals(Assert.java:631)
 at org.apache.kudu.client.TestHybridTime.test(TestHybridTime.java:167)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
 at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.lang.Thread.run(Thread.java:745){code}

> TestHybridTime is flaky
> ---
>
> Key: KUDU-2629
> URL: https://issues.apache.org/jira/browse/KUDU-2629
> Project: Kudu
>  Issue Type: Bug
>  Components: java, test
>Reporter: Andrew Wong
>Priority: Major
> Attachments: TEST-org.apache.kudu.client.TestHybridTime.xml
>
>
> I saw three back-to-back failures of TestHybridTime in which a scan returned 
> an unexpected number of rows. I've attached the XML for the test and its 
> retries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (KUDU-2402) Kudu Gerrit Sign-in link broken with Gerrit New UI

2018-11-08 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy reassigned KUDU-2402:


Assignee: Mike Percy

> Kudu Gerrit Sign-in link broken with Gerrit New UI
> --
>
> Key: KUDU-2402
> URL: https://issues.apache.org/jira/browse/KUDU-2402
> Project: Kudu
>  Issue Type: Bug
>  Components: project-infra
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
>
> Not sure if we need to upgrade the gerrit github plugin or what. The Sign In 
> link is broken after switching to the New UI in Gerrit. The URL I get is: 
> [https://gerrit.cloudera.org/login/%2Fq%2Fstatus%3Aopen] and that leads to a 
> 404 error.
> Sign-in seems to work fine after switching back to the "Old UI" in Gerrit.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (KUDU-2402) Kudu Gerrit Sign-in link broken with Gerrit New UI

2018-11-08 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy resolved KUDU-2402.
--
   Resolution: Fixed
Fix Version/s: n/a

> Kudu Gerrit Sign-in link broken with Gerrit New UI
> --
>
> Key: KUDU-2402
> URL: https://issues.apache.org/jira/browse/KUDU-2402
> Project: Kudu
>  Issue Type: Bug
>  Components: project-infra
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Major
> Fix For: n/a
>
>
> Not sure if we need to upgrade the gerrit github plugin or what. The Sign In 
> link is broken after switching to the New UI in Gerrit. The URL I get is: 
> [https://gerrit.cloudera.org/login/%2Fq%2Fstatus%3Aopen] and that leads to a 
> 404 error.
> Sign-in seems to work fine after switching back to the "Old UI" in Gerrit.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2402) Kudu Gerrit Sign-in link broken with Gerrit New UI

2018-11-08 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680230#comment-16680230
 ] 

Mike Percy commented on KUDU-2402:
--

After a lot of time sunk into this I finally figured out the root cause with 
the help of Will Wilson from Cloudera IT was an HTTP -> HTTPS redirect problem. 
I eventually got to the bottom of the Apache httpd configuration that needed to 
be changed. This is now fixed.

> Kudu Gerrit Sign-in link broken with Gerrit New UI
> --
>
> Key: KUDU-2402
> URL: https://issues.apache.org/jira/browse/KUDU-2402
> Project: Kudu
>  Issue Type: Bug
>  Components: project-infra
>Reporter: Mike Percy
>Priority: Major
>
> Not sure if we need to upgrade the gerrit github plugin or what. The Sign In 
> link is broken after switching to the New UI in Gerrit. The URL I get is: 
> [https://gerrit.cloudera.org/login/%2Fq%2Fstatus%3Aopen] and that leads to a 
> 404 error.
> Sign-in seems to work fine after switching back to the "Old UI" in Gerrit.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2614) Implement asynchronous replication

2018-10-24 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2614:


 Summary: Implement asynchronous replication
 Key: KUDU-2614
 URL: https://issues.apache.org/jira/browse/KUDU-2614
 Project: Kudu
  Issue Type: Task
Reporter: Mike Percy


Implement asynchronous cluster-to-cluster replication (across WAN links) for 
Kudu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2612) Implement multi-row transactions

2018-10-24 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2612:


 Summary: Implement multi-row transactions
 Key: KUDU-2612
 URL: https://issues.apache.org/jira/browse/KUDU-2612
 Project: Kudu
  Issue Type: Task
Reporter: Mike Percy


Tracking Jira to implement multi-row / multi-table transactions in Kudu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2613) Implement secondary indexes

2018-10-24 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2613:


 Summary: Implement secondary indexes
 Key: KUDU-2613
 URL: https://issues.apache.org/jira/browse/KUDU-2613
 Project: Kudu
  Issue Type: Task
Reporter: Mike Percy


Tracking Jira to implement secondary indexes in Kudu



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2402) Kudu Gerrit Sign-in link broken with Gerrit New UI

2018-10-02 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636249#comment-16636249
 ] 

Mike Percy commented on KUDU-2402:
--

It looks like Chromium Issue #8373 was fixed in 2.14.7 according to the release 
notes @ [https://www.gerritcodereview.com/2.14.html#2147] so that should be the 
minimum version we upgrade to.

> Kudu Gerrit Sign-in link broken with Gerrit New UI
> --
>
> Key: KUDU-2402
> URL: https://issues.apache.org/jira/browse/KUDU-2402
> Project: Kudu
>  Issue Type: Bug
>  Components: project-infra
>Reporter: Mike Percy
>Priority: Major
>
> Not sure if we need to upgrade the gerrit github plugin or what. The Sign In 
> link is broken after switching to the New UI in Gerrit. The URL I get is: 
> [https://gerrit.cloudera.org/login/%2Fq%2Fstatus%3Aopen] and that leads to a 
> 404 error.
> Sign-in seems to work fine after switching back to the "Old UI" in Gerrit.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-1521) Flakiness in TestAsyncKuduSession

2018-09-25 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627870#comment-16627870
 ] 

Mike Percy commented on KUDU-1521:
--

I also observed a case where it failed the part of the test where it expected 
the PleaseThrottleException but it never appeared:
{code:java}
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
    at org.junit.Assert.assertTrue(Assert.java:41)
    at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.kudu.client.TestAsyncKuduSession.test(TestAsyncKuduSession.java:452)
{code}
Sounds like the same issue.

> Flakiness in TestAsyncKuduSession
> -
>
> Key: KUDU-1521
> URL: https://issues.apache.org/jira/browse/KUDU-1521
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>Assignee: Todd Lipcon
>Priority: Major
> Attachments: 
> org.apache.kudu.client.TestAsyncKuduSession-TableIsDeleted-output.txt, 
> org.apache.kudu.client.TestAsyncKuduSession-output.txt, 
> org.apache.kudu.client.TestAsyncKuduSession.test.log.xz
>
>
>  I've been trying to parse the various failures in 
> http://104.196.14.100/job/kudu-gerrit/2270/BUILD_TYPE=RELEASE. Here's what I 
> see in the test:
> The way test() tests AUTO_FLUSH_BACKGROUND is inherently flaky; a delay while 
> running test code will give the background flush task a chance to fire when 
> the test code doesn't expect it. I've seen this cause lead to no 
> PleaseThrottleException, but I suspect the first block of test code dealing 
> with background flushes is flaky too (since it's testing elapsed time).
> There's also some test failures that I can't figure out. I've pasted them 
> below for posterity:
> {noformat}
> 03:52:14 
> testGetTableLocationsErrorCauseSessionStuck(org.kududb.client.TestAsyncKuduSession)
>   Time elapsed: 100.009 sec  <<< ERROR!
> 03:52:14 java.lang.Exception: test timed out after 10 milliseconds
> 03:52:14  at java.lang.Object.wait(Native Method)
> 03:52:14  at java.lang.Object.wait(Object.java:503)
> 03:52:14  at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1136)
> 03:52:14  at com.stumbleupon.async.Deferred.join(Deferred.java:1019)
> 03:52:14  at 
> org.kududb.client.TestAsyncKuduSession.testGetTableLocationsErrorCauseSessionStuck(TestAsyncKuduSession.java:133)
> 03:52:14 
> 03:52:14 
> testBatchErrorCauseSessionStuck(org.kududb.client.TestAsyncKuduSession)  Time 
> elapsed: 0.199 sec  <<< ERROR!
> 03:52:14 org.kududb.client.MasterErrorException: Server[Kudu Master - 
> 127.13.215.1:64030] NOT_FOUND[code 1]: The table was deleted: Table deleted 
> at 2016-07-09 03:50:24 UTC
> 03:52:14  at 
> org.kududb.client.TabletClient.dispatchMasterErrorOrReturnException(TabletClient.java:533)
> 03:52:14  at org.kududb.client.TabletClient.decode(TabletClient.java:463)
> 03:52:14  at org.kududb.client.TabletClient.decode(TabletClient.java:83)
> 03:52:14  at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
> 03:52:14  at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
> 03:52:14  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> 03:52:14  at 
> org.kududb.client.TabletClient.handleUpstream(TabletClient.java:638)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
> 03:52:14  at 
> org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
> 03:52:14  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
> 03:52:14  at 
> org.kududb.client.AsyncKuduClient$TabletClientPipeline.sendUpstream(AsyncKuduClient.java:1877)
> 03:52:14  at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
> 03:52:14  at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
> 03:52:14  at 
> org.jboss.

[jira] [Commented] (KUDU-2219) org.apache.kudu.client.TestKuduClient.testCloseShortlyAfterOpen flaky

2018-09-25 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627836#comment-16627836
 ] 

Mike Percy commented on KUDU-2219:
--

I traced this down and it turns out that this races with Master leader 
election. If the Master leader election is a little slow then 
KuduClient.exportAuthenticationCredentials() will throw a 
NoLeaderFoundException after trying once on each Master server. Instead, it 
should sleep and retry until it hits a timeout. That issue is filed as 
KUDU-2387.

> org.apache.kudu.client.TestKuduClient.testCloseShortlyAfterOpen flaky
> -
>
> Key: KUDU-2219
> URL: https://issues.apache.org/jira/browse/KUDU-2219
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.6.0
>Reporter: Todd Lipcon
>Assignee: Andrew Wong
>Priority: Major
> Attachments: 
> org.apache.kudu.client.TestKuduClient.testCloseShortlyAfterOpen.log.xz
>
>
> This test has an assertion that no exceptions get logged, but it seems to 
> fail sometiimes with an IllegalStateException in the log:
> {code}
> ERROR - [peer master-127.62.82.1:64034] unexpected exception from downstream 
> on [id: 0xc4472f9d, /127.62.82.1:58372 :> /127.62.82.1:64034]
> java.lang.IllegalStateException
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:429)
>   at 
> org.apache.kudu.client.Connection.messageReceived(Connection.java:264)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>   at org.apache.kudu.client.Connection.handleUpstream(Connection.java:236)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:68)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:291)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
>   at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
>   at org.apache.kudu.client.Negotiator.finish(Negotiator.java:653)
>   at 
> org.apache.kudu.client.Negotiator.handleSuccessResponse(Negotiator.java:641)
>   at 
> org.apache.kudu.client.Negotiator.handleSaslMessage(Negotiator.java:278)
>   at org.apache.kudu.client.Negotiator.handleResponse(Negotiator.java:258)
>   at 
> org.apache.kudu.client.Negotiator.messageReceived(Negotiator.java:231)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
>   at 
> org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
>   at 
> org.jb

[jira] [Created] (KUDU-2584) Flaky testSimpleBackupAndRestore

2018-09-18 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2584:


 Summary: Flaky testSimpleBackupAndRestore
 Key: KUDU-2584
 URL: https://issues.apache.org/jira/browse/KUDU-2584
 Project: Kudu
  Issue Type: Bug
  Components: backup
Reporter: Mike Percy


testSimpleBackupAndRestore is flaky and tends to fail with the following error:
{code:java}
04:48:06.604 [ERROR - Test worker] (RetryRule.java:72) 
testRandomBackupAndRestore(org.apache.kudu.backup.TestKuduBackup): failed run 1 
java.lang.AssertionError: expected:<111> but was:<110> 
at org.junit.Assert.fail(Assert.java:88) 
at org.junit.Assert.failNotEquals(Assert.java:834) 
at org.junit.Assert.assertEquals(Assert.java:645) 
at org.junit.Assert.assertEquals(Assert.java:631) 
at 
org.apache.kudu.backup.TestKuduBackup.testRandomBackupAndRestore(TestKuduBackup.scala:99)
 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:483) 
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
at org.apache.kudu.junit.RetryRule$RetryStatement.evaluate(RetryRule.java:68) 
at org.junit.rules.RunRules.evaluate(RunRules.java:20) 
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 
at org.junit.runners.ParentRunner.run(ParentRunner.java:363) 
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
 
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
 
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
 
at 
org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:66)
 
at 
org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:483) 
at 
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
 
at 
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
 
at 
org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
 
at 
org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
 
at com.sun.proxy.$Proxy2.processTestClass(Unknown Source) 
at 
org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:117)
 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:483) 
at 
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
 
at 
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
 
at 
org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:155)
 
at 
org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:137)
 
at 
org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:404)
 
at 
org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:63)
 
at 
org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(

[jira] [Updated] (KUDU-2583) LeakSanitizer failure in kudu-admin-test

2018-09-18 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2583:
-
Issue Type: Bug  (was: Improvement)

> LeakSanitizer failure in kudu-admin-test
> 
>
> Key: KUDU-2583
> URL: https://issues.apache.org/jira/browse/KUDU-2583
> Project: Kudu
>  Issue Type: Bug
>Reporter: Mike Percy
>Priority: Major
>
> Saw this error in an automated test run from kudu-admin-test in 
> DDLDuringRebalancingTest.TablesCreatedAndDeletedDuringRebalancing/0:
> {code:java}
> ==27773==ERROR: LeakSanitizer: detected memory leaks 
> Direct leak of 50 byte(s) in 1 object(s) allocated from: 
> #0 0x531928 in operator new(unsigned long) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
>  
> #1 0x377b29c3c8 in std::string::_Rep::_S_create(unsigned long, unsigned long, 
> std::allocator const&) (/usr/lib64/libstdc++.so.6+0x377b29c3c8) 
> Direct leak of 40 byte(s) in 1 object(s) allocated from: 
> #0 0x531928 in operator new(unsigned long) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
>  
> #1 0x7fe3255f5ccf in 
> _ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2IN4kudu15ClosureRunnableESaIS5_EJNS4_8CallbackIFvvEESt19_Sp_make_shared_tagPT_RKT0_DpOT1_
>  ../../../include/c++/4.9.2/bits/shared_ptr_base.h:616:25 
> #2 0x7fe3255f5b7a in 
> _ZNSt12__shared_ptrIN4kudu15ClosureRunnableELN9__gnu_cxx12_Lock_policyE2EEC2ISaIS1_EJNS0_8CallbackIFvvEESt19_Sp_make_shared_tagRKT_DpOT0_
>  ../../../include/c++/4.9.2/bits/shared_ptr_base.h:1089:14 
> #3 0x7fe3255f5a5f in 
> _ZSt15allocate_sharedIN4kudu15ClosureRunnableESaIS1_EJNS0_8CallbackIFvvESt10shared_ptrIT_ERKT0_DpOT1_
>  ../../../include/c++/4.9.2/bits/shared_ptr.h:587:14 
> #4 0x7fe3255ed9c0 in 
> _ZSt11make_sharedIN4kudu15ClosureRunnableEJNS0_8CallbackIFvvESt10shared_ptrIT_EDpOT0_
>  ../../../include/c++/4.9.2/bits/shared_ptr.h:603:14 
> #5 0x7fe3255ea383 in kudu::ThreadPool::SubmitClosure(kudu::Callback ()()>) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/util/threadpool.cc:450:17
>  
> #6 0x7fe32e4a42ff in kudu::log::Log::AppendThread::Wake() 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/log.cc:289:5
>  
> #7 0x7fe32e4af94f in 
> kudu::log::Log::AsyncAppend(std::unique_ptr std::default_delete >, kudu::Callback ()(kudu::Status const&)> const&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/log.cc:602:19
>  
> #8 0x7fe32e4affbf in 
> kudu::log::Log::AsyncAppendReplicates(std::vector,
>  std::allocator > > 
> const&, kudu::Callback const&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/log.cc:614:10
>  
> #9 0x7fe32eb67994 in 
> kudu::consensus::LogCache::AppendOperations(std::vector,
>  std::allocator > > 
> const&, kudu::Callback const&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/log_cache.cc:213:29
>  
> #10 0x7fe32eb0b99e in 
> kudu::consensus::PeerMessageQueue::AppendOperations(std::vector,
>  std::allocator > > 
> const&, kudu::Callback const&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/consensus_queue.cc:403:3
>  
> #11 0x7fe32ebc8df0 in 
> kudu::consensus::RaftConsensus::UpdateReplica(kudu::consensus::ConsensusRequestPB
>  const*, kudu::consensus::ConsensusResponsePB*) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/raft_consensus.cc:1451:7
>  
> #12 0x7fe32ebc52bf in 
> kudu::consensus::RaftConsensus::Update(kudu::consensus::ConsensusRequestPB 
> const*, kudu::consensus::ConsensusResponsePB*) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/raft_consensus.cc:914:14
>  
> #13 0x7fe331bbb369 in 
> kudu::tserver::ConsensusServiceImpl::UpdateConsensus(kudu::consensus::ConsensusRequestPB
>  const*, kudu::consensus::ConsensusResponsePB*, kudu::rpc::RpcContext*) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tserver/tablet_service.cc:946:25
>  
> #14 0x7fe3293f5cb9 in std::_Function_handler ()(google::protobuf::Message const*, google::protobuf::Message*, 
> kudu::rpc::RpcContext*), 
> kudu::consensus::ConsensusServiceIf::ConsensusServiceIf(scoped_refptr
>  const&, scoped_refptr 
> const&)::$_1>::_M_invoke(std::_Any_data const&, google::protobuf::Message 
> const*, google::protobuf::Message*, kudu::rpc::RpcContext*) 
> ../../../include/c++/4.9.2/functional:2039:2 
> #15 0x7fe32841e2fb in std::function google::protobuf::Message*, 
> kudu::rpc::RpcContext*)>::operator()(google::protobuf::Message const*, 
> google::protobuf::Message*, kudu::rpc::RpcContext*) const 
> ../../../include/c++/4.9.2/functional:2439:14 
> #16 0x7fe32841cd6a in 
> kudu::rpc::Gene

[jira] [Created] (KUDU-2583) LeakSanitizer failure in kudu-admin-test

2018-09-17 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2583:


 Summary: LeakSanitizer failure in kudu-admin-test
 Key: KUDU-2583
 URL: https://issues.apache.org/jira/browse/KUDU-2583
 Project: Kudu
  Issue Type: Improvement
Reporter: Mike Percy


Saw this error in an automated test run from kudu-admin-test in 
DDLDuringRebalancingTest.TablesCreatedAndDeletedDuringRebalancing/0:
{code:java}
==27773==ERROR: LeakSanitizer: detected memory leaks 

Direct leak of 50 byte(s) in 1 object(s) allocated from: 
#0 0x531928 in operator new(unsigned long) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
 
#1 0x377b29c3c8 in std::string::_Rep::_S_create(unsigned long, unsigned long, 
std::allocator const&) (/usr/lib64/libstdc++.so.6+0x377b29c3c8) 

Direct leak of 40 byte(s) in 1 object(s) allocated from: 
#0 0x531928 in operator new(unsigned long) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
 
#1 0x7fe3255f5ccf in 
_ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2IN4kudu15ClosureRunnableESaIS5_EJNS4_8CallbackIFvvEESt19_Sp_make_shared_tagPT_RKT0_DpOT1_
 ../../../include/c++/4.9.2/bits/shared_ptr_base.h:616:25 
#2 0x7fe3255f5b7a in 
_ZNSt12__shared_ptrIN4kudu15ClosureRunnableELN9__gnu_cxx12_Lock_policyE2EEC2ISaIS1_EJNS0_8CallbackIFvvEESt19_Sp_make_shared_tagRKT_DpOT0_
 ../../../include/c++/4.9.2/bits/shared_ptr_base.h:1089:14 
#3 0x7fe3255f5a5f in 
_ZSt15allocate_sharedIN4kudu15ClosureRunnableESaIS1_EJNS0_8CallbackIFvvESt10shared_ptrIT_ERKT0_DpOT1_
 ../../../include/c++/4.9.2/bits/shared_ptr.h:587:14 
#4 0x7fe3255ed9c0 in 
_ZSt11make_sharedIN4kudu15ClosureRunnableEJNS0_8CallbackIFvvESt10shared_ptrIT_EDpOT0_
 ../../../include/c++/4.9.2/bits/shared_ptr.h:603:14 
#5 0x7fe3255ea383 in kudu::ThreadPool::SubmitClosure(kudu::Callback) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/util/threadpool.cc:450:17
 
#6 0x7fe32e4a42ff in kudu::log::Log::AppendThread::Wake() 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/log.cc:289:5
 
#7 0x7fe32e4af94f in 
kudu::log::Log::AsyncAppend(std::unique_ptr >, kudu::Callback const&) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/log.cc:602:19
 
#8 0x7fe32e4affbf in 
kudu::log::Log::AsyncAppendReplicates(std::vector,
 std::allocator > > const&, 
kudu::Callback const&) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/log.cc:614:10
 
#9 0x7fe32eb67994 in 
kudu::consensus::LogCache::AppendOperations(std::vector,
 std::allocator > > const&, 
kudu::Callback const&) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/log_cache.cc:213:29
 
#10 0x7fe32eb0b99e in 
kudu::consensus::PeerMessageQueue::AppendOperations(std::vector,
 std::allocator > > const&, 
kudu::Callback const&) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/consensus_queue.cc:403:3
 
#11 0x7fe32ebc8df0 in 
kudu::consensus::RaftConsensus::UpdateReplica(kudu::consensus::ConsensusRequestPB
 const*, kudu::consensus::ConsensusResponsePB*) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/raft_consensus.cc:1451:7
 
#12 0x7fe32ebc52bf in 
kudu::consensus::RaftConsensus::Update(kudu::consensus::ConsensusRequestPB 
const*, kudu::consensus::ConsensusResponsePB*) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/consensus/raft_consensus.cc:914:14
 
#13 0x7fe331bbb369 in 
kudu::tserver::ConsensusServiceImpl::UpdateConsensus(kudu::consensus::ConsensusRequestPB
 const*, kudu::consensus::ConsensusResponsePB*, kudu::rpc::RpcContext*) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tserver/tablet_service.cc:946:25
 
#14 0x7fe3293f5cb9 in std::_Function_handler
 const&, scoped_refptr 
const&)::$_1>::_M_invoke(std::_Any_data const&, google::protobuf::Message 
const*, google::protobuf::Message*, kudu::rpc::RpcContext*) 
../../../include/c++/4.9.2/functional:2039:2 
#15 0x7fe32841e2fb in std::function::operator()(google::protobuf::Message const*, 
google::protobuf::Message*, kudu::rpc::RpcContext*) const 
../../../include/c++/4.9.2/functional:2439:14 
#16 0x7fe32841cd6a in 
kudu::rpc::GeneratedServiceIf::Handle(kudu::rpc::InboundCall*) 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/rpc/service_if.cc:139:3
 
#17 0x7fe328420d87 in kudu::rpc::ServicePool::RunThread() 
/data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/rpc/service_pool.cc:225:15
 
#18 0x7fe328426612 in boost::_bi::bind_t, 
boost::_bi::list1 > >::operator()() 
/data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/installed/uninstrumented/include/boost/bind/bind.hpp:1222:16
 
#19 0x7fe32837bf1b in boost::function0::operator()() const 
/data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/installed/uninstrumented/include/bo

[jira] [Resolved] (KUDU-2559) kudu-tool-test TestLoadgenDatabaseName fails with a memory leak

2018-09-17 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy resolved KUDU-2559.
--
   Resolution: Cannot Reproduce
Fix Version/s: n/a

Resolving this as cannot reproduce because there is no log file attached; 
Please reopen if you have the file!

> kudu-tool-test TestLoadgenDatabaseName fails with a memory leak
> ---
>
> Key: KUDU-2559
> URL: https://issues.apache.org/jira/browse/KUDU-2559
> Project: Kudu
>  Issue Type: Bug
>  Components: ksck
>Reporter: Andrew Wong
>Priority: Major
> Fix For: n/a
>
> Attachments: kudu-tool-test.2.xml
>
>
> I've attached a log with the LeakSanitizer error, though looking at the test 
> itself and the error, it isn't clear to me why the issue would be specific to 
> this test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2562) Checkpoint highest legal timestamp in tablet superblock when tablet history GC deletes data

2018-09-14 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2562:
-
Description: 
Checkpoint the highest legal timestamp in the tablet superblock when tablet 
history GC deletes data so that increasing the AHM age doesn’t expose us to 
inconsistent scans after a GC.

This is a real edge case and is a temporary condition depending on users 
restarting with a changed configuration flag. However without this safety 
feature, users can get bad scans if they increase the 
{{[--tablet_history_max_age_sec|http://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_tablet_history_max_age_sec]}}
 command-line flag after a GC operation runs.

  was:
Checkpoint the highest legal timestamp in the tablet superblock when tablet 
history GC deletes data so that increasing the AHM age doesn’t expose us to 
inconsistent scans after a GC.

This is a real edge case and is a temporary condition depending on users 
restarting with a changed configuration flag. However without this safety 
feature, users can get bad scans if they change the flag.


> Checkpoint highest legal timestamp in tablet superblock when tablet history 
> GC deletes data
> ---
>
> Key: KUDU-2562
> URL: https://issues.apache.org/jira/browse/KUDU-2562
> Project: Kudu
>  Issue Type: Improvement
>  Components: tablet
>Affects Versions: 1.7.1
>Reporter: Mike Percy
>Priority: Minor
>
> Checkpoint the highest legal timestamp in the tablet superblock when tablet 
> history GC deletes data so that increasing the AHM age doesn’t expose us to 
> inconsistent scans after a GC.
> This is a real edge case and is a temporary condition depending on users 
> restarting with a changed configuration flag. However without this safety 
> feature, users can get bad scans if they increase the 
> {{[--tablet_history_max_age_sec|http://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_tablet_history_max_age_sec]}}
>  command-line flag after a GC operation runs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2577) Support rebalancing data allocation across directories when adding a new data dir

2018-09-12 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2577:


 Summary: Support rebalancing data allocation across directories 
when adding a new data dir
 Key: KUDU-2577
 URL: https://issues.apache.org/jira/browse/KUDU-2577
 Project: Kudu
  Issue Type: Improvement
  Components: ops-tooling, tablet
Affects Versions: 1.7.0
Reporter: Mike Percy


I got a request for a tool to rebalance data usage across a single server's 
data directories when adding a data dir. There is no such tool, but I wanted to 
document that request because it's a reasonable feature to have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-686) Delta apply optimizations

2018-09-12 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612466#comment-16612466
 ] 

Mike Percy commented on KUDU-686:
-

[~adar], would you mind elaborating more on the new approach, what it solves, 
and how it does it?

> Delta apply optimizations
> -
>
> Key: KUDU-686
> URL: https://issues.apache.org/jira/browse/KUDU-686
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tablet
>Affects Versions: M4.5
>Reporter: David Alves
>Assignee: Adar Dembo
>Priority: Trivial
>
> We currently iterate on each delta file several times, one for deletes and 
> then one for each one of the columns.
> It seems that, when selecting all the columns it would be more efficient to 
> apply the deltas to all columns at the same time. This might or might not be 
> advantageous depending on the number of columns projected. Todd also suggest 
> that whether this is an advantage also depends on whether there are 
> predicates being pushed down.
> We could likely also merge the updates and deletes into a single iteration or 
> at least avoid applying the mutations if the row will end up delete (right 
> now we still apply the updates even when we find that the row will be 
> deleted).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2563) Spark integration should implement scanner keep-alive API

2018-09-11 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610906#comment-16610906
 ] 

Mike Percy commented on KUDU-2563:
--

It appears that we'll have to expose scanner keepalive to the Java API before 
implementing Spark support, because the only place I see the keepalive API 
defined is in the C++ client KuduScanner class @ 
http://kudu.apache.org/releases/1.7.1/cpp-client-api/classkudu_1_1client_1_1KuduScanner.html#aa4a0caf7142880255d7aac1d75f33d21

> Spark integration should implement scanner keep-alive API
> -
>
> Key: KUDU-2563
> URL: https://issues.apache.org/jira/browse/KUDU-2563
> Project: Kudu
>  Issue Type: Improvement
>  Components: client, spark
>Affects Versions: 1.7.1
>Reporter: Mike Percy
>Assignee: Grant Henke
>Priority: Major
>
> The Spark integration should implement the scanner keep-alive API like the 
> Impala scanner does in order to avoid errors related to scanners timing out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2563) Spark integration should implement scanner keep-alive API

2018-08-30 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2563:


 Summary: Spark integration should implement scanner keep-alive API
 Key: KUDU-2563
 URL: https://issues.apache.org/jira/browse/KUDU-2563
 Project: Kudu
  Issue Type: Improvement
  Components: client, spark
Affects Versions: 1.7.1
Reporter: Mike Percy


The Spark integration should implement the scanner keep-alive API like the 
Impala scanner does in order to avoid errors related to scanners timing out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2562) Checkpoint highest legal timestamp in tablet superblock when tablet history GC deletes data

2018-08-30 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2562:


 Summary: Checkpoint highest legal timestamp in tablet superblock 
when tablet history GC deletes data
 Key: KUDU-2562
 URL: https://issues.apache.org/jira/browse/KUDU-2562
 Project: Kudu
  Issue Type: Improvement
  Components: tablet
Affects Versions: 1.7.1
Reporter: Mike Percy


Checkpoint the highest legal timestamp in the tablet superblock when tablet 
history GC deletes data so that increasing the AHM age doesn’t expose us to 
inconsistent scans after a GC.

This is a real edge case and is a temporary condition depending on users 
restarting with a changed configuration flag. However without this safety 
feature, users can get bad scans if they change the flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2516) Add NOT EQUAL predicate type

2018-07-26 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2516:


 Summary: Add NOT EQUAL predicate type
 Key: KUDU-2516
 URL: https://issues.apache.org/jira/browse/KUDU-2516
 Project: Kudu
  Issue Type: Sub-task
  Components: cfile, perf
Affects Versions: 1.7.1
Reporter: Mike Percy


Kudu currently does not have support for a NOT_EQUAL predicate type. This is 
usually relevant when AND-ed together with other predicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2515) Implement Spark join optimization support

2018-07-26 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2515:


 Summary: Implement Spark join optimization support
 Key: KUDU-2515
 URL: https://issues.apache.org/jira/browse/KUDU-2515
 Project: Kudu
  Issue Type: Improvement
Affects Versions: 1.7.1
Reporter: Mike Percy


At the time of writing, Spark is not able to properly optimize joins on Kudu 
tables because Kudu does not provide statistics for Spark to use to determine 
the optimal join strategy.

It would be a big improvement to find some way to help Spark optimize joins 
between Kudu tables or between Kudu tables and Parquet-on-HDFS tables. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2513) Fix Flume sink class names on Kudu Flume Sink blog post

2018-07-25 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2513:
-
Component/s: documentation

> Fix Flume sink class names on Kudu Flume Sink blog post
> ---
>
> Key: KUDU-2513
> URL: https://issues.apache.org/jira/browse/KUDU-2513
> Project: Kudu
>  Issue Type: Improvement
>  Components: documentation, flume-sink
>Affects Versions: 1.7.1
>Reporter: Mike Percy
>Priority: Major
>  Labels: blog, newbie
>
> The blog post for the Kudu Flume sink is the easiest documentation for using 
> it but the class names have changed since it was posted and it's out of date. 
> We should fix the examples.
> https://kudu.apache.org/2016/08/31/intro-flume-kudu-sink.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2513) Fix Flume sink class names on Kudu Flume Sink blog post

2018-07-25 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2513:
-
Labels: blog newbie  (was: )

> Fix Flume sink class names on Kudu Flume Sink blog post
> ---
>
> Key: KUDU-2513
> URL: https://issues.apache.org/jira/browse/KUDU-2513
> Project: Kudu
>  Issue Type: Improvement
>  Components: flume-sink
>Affects Versions: 1.7.1
>Reporter: Mike Percy
>Priority: Major
>  Labels: blog, newbie
>
> The blog post for the Kudu Flume sink is the easiest documentation for using 
> it but the class names have changed since it was posted and it's out of date. 
> We should fix the examples.
> https://kudu.apache.org/2016/08/31/intro-flume-kudu-sink.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2513) Fix Flume sink class names on Kudu Flume Sink blog post

2018-07-25 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2513:


 Summary: Fix Flume sink class names on Kudu Flume Sink blog post
 Key: KUDU-2513
 URL: https://issues.apache.org/jira/browse/KUDU-2513
 Project: Kudu
  Issue Type: Improvement
  Components: flume-sink
Affects Versions: 1.7.1
Reporter: Mike Percy


The blog post for the Kudu Flume sink is the easiest documentation for using it 
but the class names have changed since it was posted and it's out of date. We 
should fix the examples.

https://kudu.apache.org/2016/08/31/intro-flume-kudu-sink.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2411) Create a public test utility artifact

2018-07-24 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554510#comment-16554510
 ] 

Mike Percy commented on KUDU-2411:
--

Below is a link to a bare-bones binary test artifact that I built on CentOS 6 
and was able to run --help on on Ubuntu 16.04. It's not a release version 
(1.8.0-SNAPSHOT), it contains snapshots of all security libs and should never 
be used "in production", and like I said is not really tested yet. I think it 
probably works, though.

https://drive.google.com/file/d/187tpUZJP-SiMsMVbj-9FcATUbQXmuUiy/

> Create a public test utility artifact
> -
>
> Key: KUDU-2411
> URL: https://issues.apache.org/jira/browse/KUDU-2411
> Project: Kudu
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Major
>  Labels: community
>
> Create a public published test utility jar that contains useful testing 
> utilities for applications that integrate with Kudu including things like 
> BaseKuduTest.java and MiniKuduCluster.java.
> This has the added benefit of eliminating the unusual dependency on all of 
> kudu-clients test in each of the other java modules. This could likely be 
> used in our examples code too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2411) Create a public test utility artifact

2018-07-24 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554496#comment-16554496
 ] 

Mike Percy commented on KUDU-2411:
--

Hi [~timrobertson100], I can upload an initial Linux version of the binary 
tarball somewhere for you to try out. Hopefully we are eventually going to get 
the relevant scripts merged into the Kudu main line so collaborating via Gerrit 
would be ideal because other Kudu devs will see the patches, but if you want to 
start off with a GH repo to minimize bootstrapping overhead before pushing 
patches to Gerrit then I'm open to that as well.

> Create a public test utility artifact
> -
>
> Key: KUDU-2411
> URL: https://issues.apache.org/jira/browse/KUDU-2411
> Project: Kudu
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Major
>  Labels: community
>
> Create a public published test utility jar that contains useful testing 
> utilities for applications that integrate with Kudu including things like 
> BaseKuduTest.java and MiniKuduCluster.java.
> This has the added benefit of eliminating the unusual dependency on all of 
> kudu-clients test in each of the other java modules. This could likely be 
> used in our examples code too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2506) Improve and document docs push procedure

2018-07-16 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2506:


 Summary: Improve and document docs push procedure
 Key: KUDU-2506
 URL: https://issues.apache.org/jira/browse/KUDU-2506
 Project: Kudu
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.7.1
Reporter: Mike Percy


As of this writing, when we want to push docs for a release, the release docs 
will overwrite the existing "unversioned" docs. This is a problem for 
maintenance releases, such as releasing a 1.5.1 after a 1.6.0 release, since 
the unversioned 1.6.0 docs living at [http://kudu.apache.org/docs/] will be 
replaced with 1.5.1 docs.

We should improve this process and the scripts that drive it. Potential 
improvements:
 * Add an option to the docs publish script to only update the versioned docs, 
i.e. --versioned-only
 * Separate out master vs versioned docs push into separate script invocations
 * Create a Jenkins job that can build and deploy docs to either /docs or a 
release docs location.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2505) Add menu for switching between docs versions to Kudu web site docs

2018-07-16 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2505:


 Summary: Add menu for switching between docs versions to Kudu web 
site docs
 Key: KUDU-2505
 URL: https://issues.apache.org/jira/browse/KUDU-2505
 Project: Kudu
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.7.1
Reporter: Mike Percy


It would be useful to have a "version switcher" widget on the Kudu 
documentation page that allowed people to navigate to another version of the 
docs from wherever they are, in case they land on the wrong version from a 
Google search.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2504) Add Kudu version number to header of docs pages

2018-07-16 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2504:


 Summary: Add Kudu version number to header of docs pages
 Key: KUDU-2504
 URL: https://issues.apache.org/jira/browse/KUDU-2504
 Project: Kudu
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.7.1
Reporter: Mike Percy


It is currently not easy to tell which version of the docs you are looking at 
when you are on the "unversioned" section of the Kudu docs @ 
[http://kudu.apache.org/docs/] – we should add a header or a little strip to 
the top of each page that says something like "you are looking at version 1.7.1 
of the docs" or "you are looking at docs for version 1.8.0-SNAPSHOT generated 
from Git commit eee82d90a54108f2d7e18e84ec0bbd391fcc129a"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2411) Create a public test utility artifact

2018-07-12 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542306#comment-16542306
 ] 

Mike Percy commented on KUDU-2411:
--

I threw together a proof-of-concept of this today by building on EL6, copying 
all the deps (except for a few system libs like libpthread, libc, libdl, 
libgcc, etc), changing the rpath to point to the deps, and running it on Ubuntu 
16.04. And I was able to run the binaries to the point that --help did not 
crash. I'm going to work on making generation of such an artifact a bit less 
hacky and try a more interesting test when I find a few spare minutes to work 
on it.

> Create a public test utility artifact
> -
>
> Key: KUDU-2411
> URL: https://issues.apache.org/jira/browse/KUDU-2411
> Project: Kudu
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Major
>  Labels: community
>
> Create a public published test utility jar that contains useful testing 
> utilities for applications that integrate with Kudu including things like 
> BaseKuduTest.java and MiniKuduCluster.java.
> This has the added benefit of eliminating the unusual dependency on all of 
> kudu-clients test in each of the other java modules. This could likely be 
> used in our examples code too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2411) Create a public test utility artifact

2018-07-12 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2411:
-
Labels: community  (was: )

> Create a public test utility artifact
> -
>
> Key: KUDU-2411
> URL: https://issues.apache.org/jira/browse/KUDU-2411
> Project: Kudu
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Major
>  Labels: community
>
> Create a public published test utility jar that contains useful testing 
> utilities for applications that integrate with Kudu including things like 
> BaseKuduTest.java and MiniKuduCluster.java.
> This has the added benefit of eliminating the unusual dependency on all of 
> kudu-clients test in each of the other java modules. This could likely be 
> used in our examples code too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2411) Create a public test utility artifact

2018-07-12 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2411:
-
Component/s: (was: community)

> Create a public test utility artifact
> -
>
> Key: KUDU-2411
> URL: https://issues.apache.org/jira/browse/KUDU-2411
> Project: Kudu
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Major
>
> Create a public published test utility jar that contains useful testing 
> utilities for applications that integrate with Kudu including things like 
> BaseKuduTest.java and MiniKuduCluster.java.
> This has the added benefit of eliminating the unusual dependency on all of 
> kudu-clients test in each of the other java modules. This could likely be 
> used in our examples code too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2411) Create a public test utility artifact

2018-07-12 Thread Mike Percy (JIRA)



 [ 
https://issues.apache.org/jira/browse/KUDU-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2411:
-
Component/s: community

> Create a public test utility artifact
> -
>
> Key: KUDU-2411
> URL: https://issues.apache.org/jira/browse/KUDU-2411
> Project: Kudu
>  Issue Type: Improvement
>  Components: community, java
>Affects Versions: 1.7.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Major
>
> Create a public published test utility jar that contains useful testing 
> utilities for applications that integrate with Kudu including things like 
> BaseKuduTest.java and MiniKuduCluster.java.
> This has the added benefit of eliminating the unusual dependency on all of 
> kudu-clients test in each of the other java modules. This could likely be 
> used in our examples code too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2486) Leader should back off heartbeating to failed followers

2018-06-25 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2486:


 Summary: Leader should back off heartbeating to failed followers
 Key: KUDU-2486
 URL: https://issues.apache.org/jira/browse/KUDU-2486
 Project: Kudu
  Issue Type: Improvement
  Components: consensus
Affects Versions: 1.7.1
Reporter: Mike Percy


At the time of writing, the replica leader -> follower heartbeat mechanism does 
not have a backoff mechanism built in. Rather it simply sends a heartbeat every 
configured period (say, 500ms). If a server is offline this can cause log spam 
until that replica is evicted, and if a server is overloaded this lack of a 
backoff contributes to the problem.

Since we now have pre-election support, having leaders slow down their 
heartbeat attempts when follower requests are returning errors should not cause 
unnecessary leader elections, so backing off is feasible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2484) ksck should show hostname in addition to "TS unavailable" when a host is down

2018-06-21 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2484:


 Summary: ksck should show hostname in addition to "TS unavailable" 
when a host is down
 Key: KUDU-2484
 URL: https://issues.apache.org/jira/browse/KUDU-2484
 Project: Kudu
  Issue Type: Improvement
  Components: ops-tooling
Affects Versions: 1.6.0
Reporter: Mike Percy


ksck should show hostname in addition to "TS unavailable" in the consensus 
matrix when a host is down so it's easier to troubleshoot consensus errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2438) Class relocation in the maven build should be the same as in the gradle build

2018-06-20 Thread Mike Percy (JIRA)



[ 
https://issues.apache.org/jira/browse/KUDU-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16518339#comment-16518339
 ] 

Mike Percy commented on KUDU-2438:
--

This is good stuff, I think we should bring it forward at some point but 
there's no need for you to push it through. Thanks for the WIP Ferenc!

> Class relocation in the maven build should be the same as in the gradle build
> -
>
> Key: KUDU-2438
> URL: https://issues.apache.org/jira/browse/KUDU-2438
> Project: Kudu
>  Issue Type: Bug
>Reporter: Ferenc Szabo
>Assignee: Ferenc Szabo
>Priority: Major
>
> The shaded jars from maven are referencing the original classes from guava 
> for example.
> the maven-shade-plugin should be configured to relocate them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

1 2 3 4 5 >

1 - 100 of 459 matches

Mail list logo