[jira] [Commented] (KUDU-1973) Coalesce RPCs destined for the same server

2017-04-12 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967161#comment-15967161
 ] 

Todd Lipcon commented on KUDU-1973:
---

I think this should be more specific to the consensus system, not to krpc in 
general.

The thing here is that all of the tablets have separate heartbeat timers, so 
each heartbeat (empty UpdateConsensus) RPC ends up causing a wakeup and context 
switch on the server side, etc. But there isn't really any particular reason 
not to coalesce them so they all get sent and arrive at the same time -- so 
long as it arrives before the election timer expires, it's fine to shift a 
heartbeat later by hundreds of milliseconds, for example. As such, we could 
bundle a bunch of empty UpdateConsensus RPCs destined for the same node into a 
single RPC and avoid the extra wakeups.

> Coalesce RPCs destined for the same server
> --
>
> Key: KUDU-1973
> URL: https://issues.apache.org/jira/browse/KUDU-1973
> Project: Kudu
>  Issue Type: Sub-task
>  Components: rpc, tserver
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>  Labels: data-scalability
>
> The krpc subsystem ensures that only one _connection_ exists between any pair 
> of nodes, but it doesn't coalesce the _RPCs_ themselves. In clusters with 
> dense nodes (especially with a lot of tablets), there's often a great number 
> of RPCs sent between pairs of nodes.
> We should explore ways of coalescing those RPCs. I don't know whether that 
> would happen within the krpc system itself (i.e. in a payload-agnostic way), 
> or whether we'd only coalesce RPCs known to be "hot" (like UpdateConsensus).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KUDU-1969) Please tidy up incubator distribution files

2017-04-12 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-1969.
---
   Resolution: Fixed
 Assignee: Todd Lipcon
Fix Version/s: n/a

Done, sorry for the delay.

> Please tidy up incubator distribution files
> ---
>
> Key: KUDU-1969
> URL: https://issues.apache.org/jira/browse/KUDU-1969
> Project: Kudu
>  Issue Type: Bug
> Environment: http://www.apache.org/dist/incubator/kudu/
>Reporter: Sebb
>Assignee: Todd Lipcon
> Fix For: n/a
>
>
> Please remove the old incubator releases as per:
> http://incubator.apache.org/guides/graduation.html#dist



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1972) Explore ways to reduce maintenance manager CPU load

2017-04-12 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1972:
--
Description: 
The current design of the maintenance manager includes a dedicated thread that 
wakes up every so often (default to 250 ms), looks for work to do, and 
schedules it to be done on helper threads. On a large tablet server, "look for 
work to do" can be very CPU intensive. We should explore ways to mitigate this.

Additionally, if we identify "cold" tablets (i.e. those not servicing any 
writes), we should be able to further reduce their scheduling load, perhaps by 
not running the compaction knapsack solver on them at all.

  was:
The current design of the maintenance manager includes a dedicated thread that 
wakes up every so often (default to 250 ms), looks for work to do, and 
schedules it to be done on helper threads. On a large cluster, "look for work 
to do" can be very CPU intensive. We should explore ways to mitigate this.

Additionally, if we identify "cold" tablets (i.e. those not servicing any 
writes), we should be able to further reduce their scheduling load, perhaps by 
not running the compaction knapsack solver on them at all.


> Explore ways to reduce maintenance manager CPU load
> ---
>
> Key: KUDU-1972
> URL: https://issues.apache.org/jira/browse/KUDU-1972
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tserver
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>  Labels: data-scalability
>
> The current design of the maintenance manager includes a dedicated thread 
> that wakes up every so often (default to 250 ms), looks for work to do, and 
> schedules it to be done on helper threads. On a large tablet server, "look 
> for work to do" can be very CPU intensive. We should explore ways to mitigate 
> this.
> Additionally, if we identify "cold" tablets (i.e. those not servicing any 
> writes), we should be able to further reduce their scheduling load, perhaps 
> by not running the compaction knapsack solver on them at all.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KUDU-1974) Improve web UI experience with many tablets

2017-04-12 Thread Adar Dembo (JIRA)
Adar Dembo created KUDU-1974:


 Summary: Improve web UI experience with many tablets
 Key: KUDU-1974
 URL: https://issues.apache.org/jira/browse/KUDU-1974
 Project: Kudu
  Issue Type: Sub-task
  Components: supportability, tserver
Affects Versions: 1.4.0
Reporter: Adar Dembo


On nodes with many tablets, the web UI is...not great. There are several pages 
that display something for each tablet, and those pages become unwieldy with 
thousands of tablets. We should look into either:
# Removing those pages (if they aren't adding value, or
# Collapsing the data and making it expandable/searchable, or
# Exposing the same data in a different way (i.e. perhaps it'd be more relevant 
if aggregated anyway).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KUDU-1973) Coalesce RPCs destined for the same server

2017-04-12 Thread Adar Dembo (JIRA)
Adar Dembo created KUDU-1973:


 Summary: Coalesce RPCs destined for the same server
 Key: KUDU-1973
 URL: https://issues.apache.org/jira/browse/KUDU-1973
 Project: Kudu
  Issue Type: Sub-task
  Components: rpc, tserver
Affects Versions: 1.4.0
Reporter: Adar Dembo


The krpc subsystem ensures that only one _connection_ exists between any pair 
of nodes, but it doesn't coalesce the _RPCs_ themselves. In clusters with dense 
nodes (especially with a lot of tablets), there's often a great number of RPCs 
sent between pairs of nodes.

We should explore ways of coalescing those RPCs. I don't know whether that 
would happen within the krpc system itself (i.e. in a payload-agnostic way), or 
whether we'd only coalesce RPCs known to be "hot" (like UpdateConsensus).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KUDU-1972) Explore ways to reduce maintenance manager CPU load

2017-04-12 Thread Adar Dembo (JIRA)
Adar Dembo created KUDU-1972:


 Summary: Explore ways to reduce maintenance manager CPU load
 Key: KUDU-1972
 URL: https://issues.apache.org/jira/browse/KUDU-1972
 Project: Kudu
  Issue Type: Sub-task
  Components: tserver
Affects Versions: 1.4.0
Reporter: Adar Dembo


The current design of the maintenance manager includes a dedicated thread that 
wakes up every so often (default to 250 ms), looks for work to do, and 
schedules it to be done on helper threads. On a large cluster, "look for work 
to do" can be very CPU intensive. We should explore ways to mitigate this.

Additionally, if we identify "cold" tablets (i.e. those not servicing any 
writes), we should be able to further reduce their scheduling load, perhaps by 
not running the compaction knapsack solver on them at all.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1971) Explore reducing number of data blocks by tuning existing parameters

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-1971:
-
Labels: data-scalability  (was: )

> Explore reducing number of data blocks by tuning existing parameters
> 
>
> Key: KUDU-1971
> URL: https://issues.apache.org/jira/browse/KUDU-1971
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>  Labels: data-scalability
>
> One way to scale to larger on-disk data sets is to reduce the ratio between 
> data blocks and data; that is, to make data blocks larger. Two existing 
> parameters control for this:
> * budgeted_compaction_target_rowset_size: within a given flush or compaction 
> operation, stipulates the size of each rowset. Currently 32M.
> * tablet_compaction_budget_mb: stipulates the amount of data that should be 
> included in any given compaction. Currently 128M.
> It might be interesting to explore raising these.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1913) Tablet server runs out of threads when creating lots of tablets

2017-04-12 Thread Adar Dembo (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967031#comment-15967031
 ] 

Adar Dembo commented on KUDU-1913:
--

A couple notes about this:

We currently rely on there being a single prepare thread per tablet in order to 
serialize writes via Raft replication. If these threads were aggregated across 
the tserver, we'd want a way to ensure that writes from the same tablet are 
processed serially. Chromium's [Sequenced Worker 
Pool|https://cs.chromium.org/chromium/src/base/threading/sequenced_worker_pool.h?q=base::SequencedWorkerPool&sq=package:chromium&l=72&type=cs]
 might be a good fit for this.

[MultiRaft|https://www.cockroachlabs.com/blog/scaling-raft/] is an approach 
adopted by CockroachDB to improve Raft scalability when a server has many 
tablets. It could be worth exploring for our purposes too, though I see 
CockroachDB is [now using etcd's Raft 
implementation|https://github.com/cockroachdb/cockroach/issues/20]; I don't 
know if it implements MultiRaft or not.

> Tablet server runs out of threads when creating lots of tablets
> ---
>
> Key: KUDU-1913
> URL: https://issues.apache.org/jira/browse/KUDU-1913
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus, log
>Reporter: Juan Yu
>  Labels: data-scalability
>
> When adding lots of range partitions, all tablet server crashed with the 
> following error:
> F0308 14:51:04.109369 12952 raft_consensus.cc:1985] Check failed: _s.ok() Bad 
> status: Runtime error: Could not create thread: Resource temporarily 
> unavailable (error 11)
> Tablet server should handle error/failure more gracefully instead of crashing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-383) Massive numbers of threads used by log append and GC

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-383:

Issue Type: Bug  (was: Sub-task)
Parent: (was: KUDU-1967)

> Massive numbers of threads used by log append and GC
> 
>
> Key: KUDU-383
> URL: https://issues.apache.org/jira/browse/KUDU-383
> Project: Kudu
>  Issue Type: Bug
>  Components: log
>Affects Versions: M4.5
>Reporter: Mike Percy
>  Labels: data-scalability
> Fix For: n/a
>
> Attachments: create-table-stress-test-10697.txt.gz
>
>
> {noformat}
> $ ../../build-support/stacktrace-thread-summary.pl 
> create-table-stress-test-10697.txt | awk '{print $3}' | sort | uniq -c | sort 
> -n
>   1 kudu::KernelStackWatchdog::RunThread()
>   1 kudu::MaintenanceManager::RunSchedulerThread()
>   1 kudu::master::CatalogManagerBgTasks::Run()
>   1 kudu::tserver::Heartbeater::Thread::RunThread()
>   1 kudu::tserver::ScannerManager::RunRemovalThread()
>   1 main
>   1 timer_helper_thread
>   1 timer_sigev_thread
>   2 kudu::rpc::AcceptorPool::RunThread()
>   2 master_thread
>   4 kudu::ThreadPool::DispatchThread(bool)
>  12 kudu::rpc::ReactorThread::RunThread()
>  20 kudu::rpc::ServicePool::RunThread()
>3291 kudu::log::Log::AppendThread::RunThread()
>3291 kudu::tablet::TabletPeer::RunLogGC()
> W0626 02:09:16.853266 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.854862 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> W0626 02:09:16.882686 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.884294 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> F0626 02:09:31.747577 10965 test_main.cc:63] Maximum unit test time exceeded 
> (900 sec)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KUDU-383) Massive numbers of threads used by log append and GC

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo resolved KUDU-383.
-
   Resolution: Duplicate
Fix Version/s: n/a

Yes, I'm closing a bug that's far older than it's duplicate, but KUDU-1913 does 
a better job of describing the problem.

> Massive numbers of threads used by log append and GC
> 
>
> Key: KUDU-383
> URL: https://issues.apache.org/jira/browse/KUDU-383
> Project: Kudu
>  Issue Type: Sub-task
>  Components: log
>Affects Versions: M4.5
>Reporter: Mike Percy
>  Labels: data-scalability
> Fix For: n/a
>
> Attachments: create-table-stress-test-10697.txt.gz
>
>
> {noformat}
> $ ../../build-support/stacktrace-thread-summary.pl 
> create-table-stress-test-10697.txt | awk '{print $3}' | sort | uniq -c | sort 
> -n
>   1 kudu::KernelStackWatchdog::RunThread()
>   1 kudu::MaintenanceManager::RunSchedulerThread()
>   1 kudu::master::CatalogManagerBgTasks::Run()
>   1 kudu::tserver::Heartbeater::Thread::RunThread()
>   1 kudu::tserver::ScannerManager::RunRemovalThread()
>   1 main
>   1 timer_helper_thread
>   1 timer_sigev_thread
>   2 kudu::rpc::AcceptorPool::RunThread()
>   2 master_thread
>   4 kudu::ThreadPool::DispatchThread(bool)
>  12 kudu::rpc::ReactorThread::RunThread()
>  20 kudu::rpc::ServicePool::RunThread()
>3291 kudu::log::Log::AppendThread::RunThread()
>3291 kudu::tablet::TabletPeer::RunLogGC()
> W0626 02:09:16.853266 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.854862 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> W0626 02:09:16.882686 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.884294 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> F0626 02:09:31.747577 10965 test_main.cc:63] Maximum unit test time exceeded 
> (900 sec)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KUDU-1971) Explore reducing number of data blocks by tuning existing parameters

2017-04-12 Thread Adar Dembo (JIRA)
Adar Dembo created KUDU-1971:


 Summary: Explore reducing number of data blocks by tuning existing 
parameters
 Key: KUDU-1971
 URL: https://issues.apache.org/jira/browse/KUDU-1971
 Project: Kudu
  Issue Type: Sub-task
  Components: tablet
Affects Versions: 1.4.0
Reporter: Adar Dembo


One way to scale to larger on-disk data sets is to reduce the ratio between 
data blocks and data; that is, to make data blocks larger. Two existing 
parameters control for this:
* budgeted_compaction_target_rowset_size: within a given flush or compaction 
operation, stipulates the size of each rowset. Currently 32M.
* tablet_compaction_budget_mb: stipulates the amount of data that should be 
included in any given compaction. Currently 128M.

It might be interesting to explore raising these.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1442) LBM should log startup progress periodically

2017-04-12 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967020#comment-15967020
 ] 

Jean-Daniel Cryans commented on KUDU-1442:
--

bq. That might be tricky to orchestrate across data dir threads; maybe once per 
thread per x number of containers?

Yeah that sounds pretty good. Dunno what x should be though.

> LBM should log startup progress periodically
> 
>
> Key: KUDU-1442
> URL: https://issues.apache.org/jira/browse/KUDU-1442
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Reporter: zhangsong
>Priority: Trivial
>  Labels: data-scalability, newbie
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-38) bootstrap should not replay logs that are known to be fully flushed

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-38?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-38:
---
Issue Type: Sub-task  (was: Improvement)
Parent: KUDU-1967

> bootstrap should not replay logs that are known to be fully flushed
> ---
>
> Key: KUDU-38
> URL: https://issues.apache.org/jira/browse/KUDU-38
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Affects Versions: M3
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>  Labels: data-scalability, startup-time
>
> Currently the bootstrap process will process all of the log segments, 
> including those that can be trivially determined to contain only durable 
> edits. This makes startup unnecessarily slow.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (KUDU-1830) Reduce Kudu WAL log disk usage

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reassigned KUDU-1830:


Assignee: (was: Adar Dembo)

> Reduce Kudu WAL log disk usage
> --
>
> Key: KUDU-1830
> URL: https://issues.apache.org/jira/browse/KUDU-1830
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus, log
>Reporter: Juan Yu
>  Labels: data-scalability
>
> WAL log can take significent disk space. So far there are some config to 
> limit it. but it can go very high.
> WAL log size = #tablets * log_segment_size_mb  * log segments (1 if there is 
> write ops to this tablet, can go up to log_max_segments_to_retain)
> Logs are retained even if there is no write for a while.
> We could reduce the WAL log usage by
> - reduce min_segments_to_retain to 1 instead of 2, a
> - reduce steady state consumption of idle tablets, roll a WAL if it has had 
> no writes for a few minutes and size more than a MB or two so that "idle" 
> tablets have 0 WAL space consumed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (KUDU-383) Massive numbers of threads used by log append and GC

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reassigned KUDU-383:
---

Assignee: (was: Adar Dembo)

> Massive numbers of threads used by log append and GC
> 
>
> Key: KUDU-383
> URL: https://issues.apache.org/jira/browse/KUDU-383
> Project: Kudu
>  Issue Type: Sub-task
>  Components: log
>Affects Versions: M4.5
>Reporter: Mike Percy
>  Labels: data-scalability
> Attachments: create-table-stress-test-10697.txt.gz
>
>
> {noformat}
> $ ../../build-support/stacktrace-thread-summary.pl 
> create-table-stress-test-10697.txt | awk '{print $3}' | sort | uniq -c | sort 
> -n
>   1 kudu::KernelStackWatchdog::RunThread()
>   1 kudu::MaintenanceManager::RunSchedulerThread()
>   1 kudu::master::CatalogManagerBgTasks::Run()
>   1 kudu::tserver::Heartbeater::Thread::RunThread()
>   1 kudu::tserver::ScannerManager::RunRemovalThread()
>   1 main
>   1 timer_helper_thread
>   1 timer_sigev_thread
>   2 kudu::rpc::AcceptorPool::RunThread()
>   2 master_thread
>   4 kudu::ThreadPool::DispatchThread(bool)
>  12 kudu::rpc::ReactorThread::RunThread()
>  20 kudu::rpc::ServicePool::RunThread()
>3291 kudu::log::Log::AppendThread::RunThread()
>3291 kudu::tablet::TabletPeer::RunLogGC()
> W0626 02:09:16.853266 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.854862 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> W0626 02:09:16.882686 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.884294 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> F0626 02:09:31.747577 10965 test_main.cc:63] Maximum unit test time exceeded 
> (900 sec)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (KUDU-1913) Tablet server runs out of threads when creating lots of tablets

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reassigned KUDU-1913:


Assignee: (was: Adar Dembo)

> Tablet server runs out of threads when creating lots of tablets
> ---
>
> Key: KUDU-1913
> URL: https://issues.apache.org/jira/browse/KUDU-1913
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus, log
>Reporter: Juan Yu
>  Labels: data-scalability
>
> When adding lots of range partitions, all tablet server crashed with the 
> following error:
> F0308 14:51:04.109369 12952 raft_consensus.cc:1985] Check failed: _s.ok() Bad 
> status: Runtime error: Could not create thread: Resource temporarily 
> unavailable (error 11)
> Tablet server should handle error/failure more gracefully instead of crashing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1549) LBM should start up faster

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-1549:
-
Issue Type: Sub-task  (was: Improvement)
Parent: KUDU-1967

> LBM should start up faster
> --
>
> Key: KUDU-1549
> URL: https://issues.apache.org/jira/browse/KUDU-1549
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet, tserver
> Environment: cpu: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
> mem: 252 G
> disk: single ssd  1.5 T left.
>Reporter: zhangsong
>  Labels: data-scalability
> Attachments: a14844513e5243a993b2b84bf0dcec4c.short.txt
>
>
> After experiencing physical node crash, it found recovery/start speed of 
> kudu-tserver is slower than that of usual restart case.  There are some 
> message like "Found partial trailing metadata" in kudu-tserver log and it 
> seems cost more than 20 minute to recover these metadata.
> According to adar , it should be this slow.
> attachment is the start log .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1549) LBM should start up faster

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-1549:
-
Summary: LBM should start up faster  (was: recovery speed of kudu-tserver 
should be faster.)

I'm repurposing this JIRA for the general problem of "LBM startup is too damn 
slow."

Some potential improvements:
# Identify and delete LBM containers that are full but have no live blocks. 
This can happen at startup time, at last-live-block-deletion time, periodically 
(perhaps via maintenance manager scheduling), or some combination of the above
# Identify LBM containers that are full and have very few live blocks. 
"Defragment" the container and make it available for writing again. Probably 
best to do this periodically; it may get expensive to do it at startup or when 
the container becomes full.
# Compact LBM container metadata by identifying and removing CREATE/DELETE 
pairs of records. Probably best to restrict this to full containers. Not sure 
when it's best to do it.

> LBM should start up faster
> --
>
> Key: KUDU-1549
> URL: https://issues.apache.org/jira/browse/KUDU-1549
> Project: Kudu
>  Issue Type: Improvement
>  Components: tablet, tserver
> Environment: cpu: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
> mem: 252 G
> disk: single ssd  1.5 T left.
>Reporter: zhangsong
>  Labels: data-scalability
> Attachments: a14844513e5243a993b2b84bf0dcec4c.short.txt
>
>
> After experiencing physical node crash, it found recovery/start speed of 
> kudu-tserver is slower than that of usual restart case.  There are some 
> message like "Found partial trailing metadata" in kudu-tserver log and it 
> seems cost more than 20 minute to recover these metadata.
> According to adar , it should be this slow.
> attachment is the start log .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1442) LBM should log startup progress periodically

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-1442:
-
 Labels: data-scalability newbie  (was: data-scalability)
Summary: LBM should log startup progress periodically  (was: kudu should 
have more log when starting up, user need to check the progress of starting.)

> LBM should log startup progress periodically
> 
>
> Key: KUDU-1442
> URL: https://issues.apache.org/jira/browse/KUDU-1442
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Reporter: zhangsong
>Priority: Trivial
>  Labels: data-scalability, newbie
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1442) kudu should have more log when starting up, user need to check the progress of starting.

2017-04-12 Thread Adar Dembo (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967012#comment-15967012
 ] 

Adar Dembo commented on KUDU-1442:
--

Apart from fixing KUDU-1192, we should (as [~bruceSz] said) log more progress 
information in the log block manager.

Perhaps a periodic progress message every second that describes how many 
containers we've loaded thus far? That might be tricky to orchestrate across 
data dir threads; maybe once per thread per x number of containers?


> kudu should have more log when starting up, user need to check the progress 
> of starting.
> 
>
> Key: KUDU-1442
> URL: https://issues.apache.org/jira/browse/KUDU-1442
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Reporter: zhangsong
>Priority: Trivial
>  Labels: data-scalability
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1442) kudu should have more log when starting up, user need to check the progress of starting.

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-1442:
-
Issue Type: Sub-task  (was: Improvement)
Parent: KUDU-1967

> kudu should have more log when starting up, user need to check the progress 
> of starting.
> 
>
> Key: KUDU-1442
> URL: https://issues.apache.org/jira/browse/KUDU-1442
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Reporter: zhangsong
>Priority: Trivial
>  Labels: data-scalability
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1125) Reduce impact of enabling fsync on the master

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-1125:
-
Issue Type: Sub-task  (was: Improvement)
Parent: KUDU-1967

> Reduce impact of enabling fsync on the master
> -
>
> Key: KUDU-1125
> URL: https://issues.apache.org/jira/browse/KUDU-1125
> Project: Kudu
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: Feature Complete
>Reporter: Jean-Daniel Cryans
>Priority: Critical
>  Labels: data-scalability
>
> First time running ITBLL since we enabled fsync in the master and I'm now 
> seeing RPCs timing out because the master is always ERROR_SERVER_TOO_BUSY. In 
> the log I can see a lot of elections going on and the queue is always full.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (KUDU-1830) Reduce Kudu WAL log disk usage

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reassigned KUDU-1830:


Assignee: Adar Dembo

> Reduce Kudu WAL log disk usage
> --
>
> Key: KUDU-1830
> URL: https://issues.apache.org/jira/browse/KUDU-1830
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus, log
>Reporter: Juan Yu
>Assignee: Adar Dembo
>  Labels: data-scalability
>
> WAL log can take significent disk space. So far there are some config to 
> limit it. but it can go very high.
> WAL log size = #tablets * log_segment_size_mb  * log segments (1 if there is 
> write ops to this tablet, can go up to log_max_segments_to_retain)
> Logs are retained even if there is no write for a while.
> We could reduce the WAL log usage by
> - reduce min_segments_to_retain to 1 instead of 2, a
> - reduce steady state consumption of idle tablets, roll a WAL if it has had 
> no writes for a few minutes and size more than a MB or two so that "idle" 
> tablets have 0 WAL space consumed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1807) GetTableSchema() is O(n) in the number of tablets

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-1807:
-
Issue Type: Sub-task  (was: Bug)
Parent: KUDU-1967

> GetTableSchema() is O(n) in the number of tablets
> -
>
> Key: KUDU-1807
> URL: https://issues.apache.org/jira/browse/KUDU-1807
> Project: Kudu
>  Issue Type: Sub-task
>  Components: master, perf
>Affects Versions: 1.2.0
>Reporter: Todd Lipcon
>Priority: Critical
>  Labels: data-scalability
>
> GetTableSchema calls TableInfo::IsCreateTableDone. This method checks each 
> tablet for whether it is in the correct state, which requires acquiring the 
> RWC lock for every tablet. This is somewhat slow for large tables with 
> thousands of tablets, and this is actually a relatively hot path because 
> every task in an Impala query ends up calling GetTableSchema() when it opens 
> its scanner.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (KUDU-1913) Tablet server runs out of threads when creating lots of tablets

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reassigned KUDU-1913:


Assignee: Adar Dembo

> Tablet server runs out of threads when creating lots of tablets
> ---
>
> Key: KUDU-1913
> URL: https://issues.apache.org/jira/browse/KUDU-1913
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus, log
>Reporter: Juan Yu
>Assignee: Adar Dembo
>  Labels: data-scalability
>
> When adding lots of range partitions, all tablet server crashed with the 
> following error:
> F0308 14:51:04.109369 12952 raft_consensus.cc:1985] Check failed: _s.ok() Bad 
> status: Runtime error: Could not create thread: Resource temporarily 
> unavailable (error 11)
> Tablet server should handle error/failure more gracefully instead of crashing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1830) Reduce Kudu WAL log disk usage

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-1830:
-
Issue Type: Sub-task  (was: Improvement)
Parent: KUDU-1967

> Reduce Kudu WAL log disk usage
> --
>
> Key: KUDU-1830
> URL: https://issues.apache.org/jira/browse/KUDU-1830
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus, log
>Reporter: Juan Yu
>  Labels: data-scalability
>
> WAL log can take significent disk space. So far there are some config to 
> limit it. but it can go very high.
> WAL log size = #tablets * log_segment_size_mb  * log segments (1 if there is 
> write ops to this tablet, can go up to log_max_segments_to_retain)
> Logs are retained even if there is no write for a while.
> We could reduce the WAL log usage by
> - reduce min_segments_to_retain to 1 instead of 2, a
> - reduce steady state consumption of idle tablets, roll a WAL if it has had 
> no writes for a few minutes and size more than a MB or two so that "idle" 
> tablets have 0 WAL space consumed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-383) Massive numbers of threads used by log append and GC

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-383:

Issue Type: Sub-task  (was: Bug)
Parent: KUDU-1967

> Massive numbers of threads used by log append and GC
> 
>
> Key: KUDU-383
> URL: https://issues.apache.org/jira/browse/KUDU-383
> Project: Kudu
>  Issue Type: Sub-task
>  Components: log
>Affects Versions: M4.5
>Reporter: Mike Percy
>Assignee: Adar Dembo
>  Labels: data-scalability
> Attachments: create-table-stress-test-10697.txt.gz
>
>
> {noformat}
> $ ../../build-support/stacktrace-thread-summary.pl 
> create-table-stress-test-10697.txt | awk '{print $3}' | sort | uniq -c | sort 
> -n
>   1 kudu::KernelStackWatchdog::RunThread()
>   1 kudu::MaintenanceManager::RunSchedulerThread()
>   1 kudu::master::CatalogManagerBgTasks::Run()
>   1 kudu::tserver::Heartbeater::Thread::RunThread()
>   1 kudu::tserver::ScannerManager::RunRemovalThread()
>   1 main
>   1 timer_helper_thread
>   1 timer_sigev_thread
>   2 kudu::rpc::AcceptorPool::RunThread()
>   2 master_thread
>   4 kudu::ThreadPool::DispatchThread(bool)
>  12 kudu::rpc::ReactorThread::RunThread()
>  20 kudu::rpc::ServicePool::RunThread()
>3291 kudu::log::Log::AppendThread::RunThread()
>3291 kudu::tablet::TabletPeer::RunLogGC()
> W0626 02:09:16.853266 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.854862 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> W0626 02:09:16.882686 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.884294 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> F0626 02:09:31.747577 10965 test_main.cc:63] Maximum unit test time exceeded 
> (900 sec)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-383) Massive numbers of threads used by log append and GC

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-383:

Labels: data-scalability  (was: )

> Massive numbers of threads used by log append and GC
> 
>
> Key: KUDU-383
> URL: https://issues.apache.org/jira/browse/KUDU-383
> Project: Kudu
>  Issue Type: Bug
>  Components: log
>Affects Versions: M4.5
>Reporter: Mike Percy
>Assignee: Mike Percy
>  Labels: data-scalability
> Attachments: create-table-stress-test-10697.txt.gz
>
>
> {noformat}
> $ ../../build-support/stacktrace-thread-summary.pl 
> create-table-stress-test-10697.txt | awk '{print $3}' | sort | uniq -c | sort 
> -n
>   1 kudu::KernelStackWatchdog::RunThread()
>   1 kudu::MaintenanceManager::RunSchedulerThread()
>   1 kudu::master::CatalogManagerBgTasks::Run()
>   1 kudu::tserver::Heartbeater::Thread::RunThread()
>   1 kudu::tserver::ScannerManager::RunRemovalThread()
>   1 main
>   1 timer_helper_thread
>   1 timer_sigev_thread
>   2 kudu::rpc::AcceptorPool::RunThread()
>   2 master_thread
>   4 kudu::ThreadPool::DispatchThread(bool)
>  12 kudu::rpc::ReactorThread::RunThread()
>  20 kudu::rpc::ServicePool::RunThread()
>3291 kudu::log::Log::AppendThread::RunThread()
>3291 kudu::tablet::TabletPeer::RunLogGC()
> W0626 02:09:16.853266 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.854862 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> W0626 02:09:16.882686 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.884294 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> F0626 02:09:31.747577 10965 test_main.cc:63] Maximum unit test time exceeded 
> (900 sec)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (KUDU-383) Massive numbers of threads used by log append and GC

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reassigned KUDU-383:
---

Assignee: Adar Dembo  (was: Mike Percy)

> Massive numbers of threads used by log append and GC
> 
>
> Key: KUDU-383
> URL: https://issues.apache.org/jira/browse/KUDU-383
> Project: Kudu
>  Issue Type: Bug
>  Components: log
>Affects Versions: M4.5
>Reporter: Mike Percy
>Assignee: Adar Dembo
>  Labels: data-scalability
> Attachments: create-table-stress-test-10697.txt.gz
>
>
> {noformat}
> $ ../../build-support/stacktrace-thread-summary.pl 
> create-table-stress-test-10697.txt | awk '{print $3}' | sort | uniq -c | sort 
> -n
>   1 kudu::KernelStackWatchdog::RunThread()
>   1 kudu::MaintenanceManager::RunSchedulerThread()
>   1 kudu::master::CatalogManagerBgTasks::Run()
>   1 kudu::tserver::Heartbeater::Thread::RunThread()
>   1 kudu::tserver::ScannerManager::RunRemovalThread()
>   1 main
>   1 timer_helper_thread
>   1 timer_sigev_thread
>   2 kudu::rpc::AcceptorPool::RunThread()
>   2 master_thread
>   4 kudu::ThreadPool::DispatchThread(bool)
>  12 kudu::rpc::ReactorThread::RunThread()
>  20 kudu::rpc::ServicePool::RunThread()
>3291 kudu::log::Log::AppendThread::RunThread()
>3291 kudu::tablet::TabletPeer::RunLogGC()
> W0626 02:09:16.853266 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.854862 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> W0626 02:09:16.882686 27997 messenger.cc:187] Unable to handle RPC call: 
> Service unavailable: TSHeartbeat request on kudu.master.MasterService from 
> 127.0.0.1:46937 dropped due to backpressure. The service queue is full; it 
> has 50 items.
> W0626 02:09:16.884294 28074 heartbeater.cc:278] Failed to heartbeat: Remote 
> error: Failed to send heartbeat: Service unavailable: TSHeartbeat request on 
> kudu.master.MasterService from 127.0.0.1:46937 dropped due to backpressure. 
> The service queue is full; it has 50 items.
> F0626 02:09:31.747577 10965 test_main.cc:63] Maximum unit test time exceeded 
> (900 sec)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1913) Tablet server runs out of threads when creating lots of tablets

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-1913:
-
Issue Type: Sub-task  (was: Bug)
Parent: KUDU-1967

> Tablet server runs out of threads when creating lots of tablets
> ---
>
> Key: KUDU-1913
> URL: https://issues.apache.org/jira/browse/KUDU-1913
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus, log
>Reporter: Juan Yu
>  Labels: data-scalability
>
> When adding lots of range partitions, all tablet server crashed with the 
> following error:
> F0308 14:51:04.109369 12952 raft_consensus.cc:1985] Check failed: _s.ok() Bad 
> status: Runtime error: Could not create thread: Resource temporarily 
> unavailable (error 11)
> Tablet server should handle error/failure more gracefully instead of crashing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KUDU-1970) Integration test for data scalability

2017-04-12 Thread Adar Dembo (JIRA)
Adar Dembo created KUDU-1970:


 Summary: Integration test for data scalability
 Key: KUDU-1970
 URL: https://issues.apache.org/jira/browse/KUDU-1970
 Project: Kudu
  Issue Type: Sub-task
  Components: master, tserver
Affects Versions: 1.4.0
Reporter: Adar Dembo
Assignee: Adar Dembo


To help test data scalability fixes, we need a way to easily produce an 
environment that exhibits our current scalability issues. I'm sure one of our 
long-running workloads would be up to the task, but aside from taking a long 
time, it'd also fill up the disk, which makes it unusable on most developer 
machines. Ultimately, data isn't really the root cause of our scalability woes; 
it's the metadata necessary to maintain the data that hurts us. So an idealized 
environment would be heavy on the metadata. Here's a not-so-exhaustive list:
* Many tablets.
* Many columns per tablet.
* Many rowsets per tablet.
* Many data blocks.
* Many tables (tservers don't care about this, but maybe the master does?)

Let's write an integration test that swamps the machine with the above. It 
should be use an external mini cluster to simplify isolating master and tserver 
performance characteristics, but it needn't have more than one instance of each.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KUDU-1837) kudu-jepsen reports non-linearizable history for the tserver-majorities-ring scenario

2017-04-12 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966918#comment-15966918
 ] 

Alexey Serbin edited comment on KUDU-1837 at 4/13/17 12:34 AM:
---

This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

After the client model has been updated, no error occurred so far.  If any new 
issues is found by the new kudu-jepsen code, please open a new JIRA item for 
that.


was (Author: aserbin):
This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

> kudu-jepsen reports non-linearizable history for the tserver-majorities-ring 
> scenario
> -
>
> Key: KUDU-1837
> URL: https://issues.apache.org/jira/browse/KUDU-1837
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Alexey Serbin
>  Labels: kudu-jepsen
> Fix For: 1.4.0
>
> Attachments: 20170115T142504.000-0800.tar.bz2
>
>
> The kudu-jepsen test has found an instance of non-linearizable history for 
> the tserver-majorities-ring scenario.  The artifacts from the failed scenario 
> are attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KUDU-1825) kudu-jepsen reports non-linearizable history for the kill-restart-all-tservers scenario

2017-04-12 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966917#comment-15966917
 ] 

Alexey Serbin edited comment on KUDU-1825 at 4/13/17 12:33 AM:
---

This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

After the client model has been updated, no error occurred so far.  If any new 
issues is found by the new kudu-jepsen code, please open a new JIRA item for 
that.


was (Author: aserbin):
This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

> kudu-jepsen reports non-linearizable history for the 
> kill-restart-all-tservers scenario
> ---
>
> Key: KUDU-1825
> URL: https://issues.apache.org/jira/browse/KUDU-1825
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>  Labels: kudu-jepsen
> Fix For: 1.4.0
>
> Attachments: history.txt, linear.svg, master.log, ts-1.log, ts-2.log, 
> ts-3.log, ts-4.log, ts-5.log
>
>
> The kudu-jepsen test found non-linearizable history of operations for 
> kill-restart-all-tservers scenario.
> The artifacts of the failed scenario are attached.
> It's necessary to create reproducible scenarios for that and fix, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KUDU-1842) kudu-jepsen reports non-linearizable history for the 'all-random-halves' scenario

2017-04-12 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966919#comment-15966919
 ] 

Alexey Serbin edited comment on KUDU-1842 at 4/13/17 12:33 AM:
---

This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

After the client model has been updated, no error occurred so far.  If any new 
issues is found by the new kudu-jepsen code, please open a new JIRA item for 
that.


was (Author: aserbin):
This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

> kudu-jepsen reports non-linearizable history for the 'all-random-halves' 
> scenario
> -
>
> Key: KUDU-1842
> URL: https://issues.apache.org/jira/browse/KUDU-1842
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, test
>Affects Versions: 1.2.0
>Reporter: Alexey Serbin
>  Labels: kudu-jepsen
> Fix For: 1.4.0
>
> Attachments: 20170119T023411.000-0800.tar.bz2
>
>
> The kudu-jepsen test has found an instance of non-linearizable history for 
> the  'all-random-halves' nemesis scenario. The artifacts from the failed 
> scenario are attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KUDU-1838) kudu-jepsen reports non-linearizable history for the hammer-3-tservers scenario

2017-04-12 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966914#comment-15966914
 ] 

Alexey Serbin edited comment on KUDU-1838 at 4/13/17 12:32 AM:
---

This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

After the client model has been updated, no error occurred so far.  If any new 
issues is found by the new kudu-jepsen code, please open a new JIRA item for 
that.


was (Author: aserbin):
This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

> kudu-jepsen reports non-linearizable history for the hammer-3-tservers 
> scenario
> ---
>
> Key: KUDU-1838
> URL: https://issues.apache.org/jira/browse/KUDU-1838
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Alexey Serbin
>  Labels: kudu-jepsen
> Fix For: 1.4.0
>
> Attachments: 20170115T165415.000-0800.tar.bz2
>
>
> The kudu-jepsen test has found an instance of non-linearizable history for 
> the hammer-3-tservers scenario. The artifacts from the failed scenario are 
> attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1838) kudu-jepsen reports non-linearizable history for the hammer-3-tservers scenario

2017-04-12 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966921#comment-15966921
 ] 

David Alves commented on KUDU-1838:
---

Should add that since that was fixed this error no longer occurs

> kudu-jepsen reports non-linearizable history for the hammer-3-tservers 
> scenario
> ---
>
> Key: KUDU-1838
> URL: https://issues.apache.org/jira/browse/KUDU-1838
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Alexey Serbin
>  Labels: kudu-jepsen
> Fix For: 1.4.0
>
> Attachments: 20170115T165415.000-0800.tar.bz2
>
>
> The kudu-jepsen test has found an instance of non-linearizable history for 
> the hammer-3-tservers scenario. The artifacts from the failed scenario are 
> attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KUDU-1837) kudu-jepsen reports non-linearizable history for the tserver-majorities-ring scenario

2017-04-12 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-1837.
-
   Resolution: Invalid
Fix Version/s: 1.4.0

This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

> kudu-jepsen reports non-linearizable history for the tserver-majorities-ring 
> scenario
> -
>
> Key: KUDU-1837
> URL: https://issues.apache.org/jira/browse/KUDU-1837
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Alexey Serbin
>  Labels: kudu-jepsen
> Fix For: 1.4.0
>
> Attachments: 20170115T142504.000-0800.tar.bz2
>
>
> The kudu-jepsen test has found an instance of non-linearizable history for 
> the tserver-majorities-ring scenario.  The artifacts from the failed scenario 
> are attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KUDU-1842) kudu-jepsen reports non-linearizable history for the 'all-random-halves' scenario

2017-04-12 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-1842.
-
   Resolution: Invalid
Fix Version/s: 1.4.0

This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

> kudu-jepsen reports non-linearizable history for the 'all-random-halves' 
> scenario
> -
>
> Key: KUDU-1842
> URL: https://issues.apache.org/jira/browse/KUDU-1842
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, test
>Affects Versions: 1.2.0
>Reporter: Alexey Serbin
>  Labels: kudu-jepsen
> Fix For: 1.4.0
>
> Attachments: 20170119T023411.000-0800.tar.bz2
>
>
> The kudu-jepsen test has found an instance of non-linearizable history for 
> the  'all-random-halves' nemesis scenario. The artifacts from the failed 
> scenario are attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KUDU-1825) kudu-jepsen reports non-linearizable history for the kill-restart-all-tservers scenario

2017-04-12 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-1825.
-
   Resolution: Invalid
Fix Version/s: 1.4.0

This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

> kudu-jepsen reports non-linearizable history for the 
> kill-restart-all-tservers scenario
> ---
>
> Key: KUDU-1825
> URL: https://issues.apache.org/jira/browse/KUDU-1825
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>  Labels: kudu-jepsen
> Fix For: 1.4.0
>
> Attachments: history.txt, linear.svg, master.log, ts-1.log, ts-2.log, 
> ts-3.log, ts-4.log, ts-5.log
>
>
> The kudu-jepsen test found non-linearizable history of operations for 
> kill-restart-all-tservers scenario.
> The artifacts of the failed scenario are attached.
> It's necessary to create reproducible scenarios for that and fix, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KUDU-1838) kudu-jepsen reports non-linearizable history for the hammer-3-tservers scenario

2017-04-12 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-1838.
-
   Resolution: Invalid
Fix Version/s: 1.4.0

This bug is not valid since the kudu-jepsen ran in wrong client model: using a 
separate Kudu client per test actor.  That model would require to propagate 
timestamp between clients for sound test results, but that was not the case.

The client model has been updated by changelist 
678a309b5d88e2fe9c6a0674ba7de00daee35cac: since then all test actors share the 
same Kudu client instance, so timestamp is propagated automatically.

> kudu-jepsen reports non-linearizable history for the hammer-3-tservers 
> scenario
> ---
>
> Key: KUDU-1838
> URL: https://issues.apache.org/jira/browse/KUDU-1838
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Alexey Serbin
>  Labels: kudu-jepsen
> Fix For: 1.4.0
>
> Attachments: 20170115T165415.000-0800.tar.bz2
>
>
> The kudu-jepsen test has found an instance of non-linearizable history for 
> the hammer-3-tservers scenario. The artifacts from the failed scenario are 
> attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KUDU-1966) Data directories can be removed erroneously

2017-04-12 Thread Dan Burkert (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Burkert resolved KUDU-1966.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

> Data directories can be removed erroneously
> ---
>
> Key: KUDU-1966
> URL: https://issues.apache.org/jira/browse/KUDU-1966
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Dan Burkert
> Fix For: 1.4.0
>
>
> Kudu data directories can be removed in between starts of a server, which 
> will lead to tablet bootstrap failures. There exists logic to protect against 
> this (see 
> [PathInstanceMetadataFile::CheckIntegrity()|https://github.com/apache/kudu/blob/master/src/kudu/fs/block_manager_util.h#L78]),
>  but it was missing a key check to ensure that no directory had been removed 
> from the set.
> To be clear, we do want to support removing data directories, but in a more 
> structured and protected manner. For the time being, we should close this 
> loophole.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1966) Data directories can be removed erroneously

2017-04-12 Thread Dan Burkert (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966899#comment-15966899
 ] 

Dan Burkert commented on KUDU-1966:
---

Resolved by 
[bd24f04fb43db9f1fcbf9a60ecc31824c3c79bfd|https://github.com/apache/kudu/commit/bd24f04fb43db9f1fcbf9a60ecc31824c3c79bfd].

> Data directories can be removed erroneously
> ---
>
> Key: KUDU-1966
> URL: https://issues.apache.org/jira/browse/KUDU-1966
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Dan Burkert
>
> Kudu data directories can be removed in between starts of a server, which 
> will lead to tablet bootstrap failures. There exists logic to protect against 
> this (see 
> [PathInstanceMetadataFile::CheckIntegrity()|https://github.com/apache/kudu/blob/master/src/kudu/fs/block_manager_util.h#L78]),
>  but it was missing a key check to ensure that no directory had been removed 
> from the set.
> To be clear, we do want to support removing data directories, but in a more 
> structured and protected manner. For the time being, we should close this 
> loophole.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KUDU-1968) Aborted tablet copies delete live blocks

2017-04-12 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-1968.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Resolved by reverting the above-mentioned patch. Will fast-track a 1.3.1 
release.

> Aborted tablet copies delete live blocks
> 
>
> Key: KUDU-1968
> URL: https://issues.apache.org/jira/browse/KUDU-1968
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: 1.3.1, 1.4.0
>
>
> 72541b47eb55b2df4eab5d6050f517476ed6d370 (KUDU-1853) caused a serious 
> regression in the case of a failed tablet copy. As of that patch, the 
> following sequence happens:
> - we fetch the remote tablet's metadata, and set our local metadata to match 
> it (including the remote block IDs)
> - as we download blocks, we replace remote block ids with local block IDs
> - if we fail in the middle, we call DeleteTablet
> -- this means that, since we still have some remote block IDs in the 
> metadata, the DeleteTablet call deletes local blocks based on remote block 
> IDs. These block ids are likely to belong to other live tablets locally!
> This can cause pretty serious dataloss, and has the tendency to cascade 
> around a cluster, since later attempts to copy a tablet with missing blocks 
> will get aborted as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Reopened] (KUDU-1853) Error during tablet copy may orphan a bunch of stuff

2017-04-12 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reopened KUDU-1853:
--

With the revert of 72541b47eb55b2df4eab5d6050f517476ed6d370 (see KUDU-1968), 
this bug is no longer fixed. Well, it's still fixed for 1.3.0 (which has 
already been released), but not for any subsequent release.

> Error during tablet copy may orphan a bunch of stuff
> 
>
> Key: KUDU-1853
> URL: https://issues.apache.org/jira/browse/KUDU-1853
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet, tserver
>Affects Versions: 1.2.0
>Reporter: Adar Dembo
>Assignee: Mike Percy
>Priority: Critical
> Fix For: 1.3.0
>
>
> Currently, a failure during tablet copy may leave behind a number of 
> different things:
> # Downloaded superblock (if the failure falls after TabletCopyClient::Start())
> # Downloaded data blocks (if the failure falls during 
> TabletCopyClient::FetchAll())
> # Downloaded WAL segments (if the failure falls during 
> TabletCopyClient::FetchAll())
> # Downloaded cmeta file (if the failure falls during 
> TabletCopyClient::Finish())
> The next time the tserver starts, it'll see that this tablet's state is still 
> TABLET_DATA_COPYING and will tombstone it. That takes care of #1, #3, and #4 
> (well, it leaves the cmeta file behind as the tombstone, but that's 
> intentional).
> Unfortunately, all data blocks are orphaned, because the on-disk superblock 
> has no record of the new blocks, and so they aren't deleted.
> We're already tracking a general purpose GC mechanism for data blocks in 
> KUDU-829, but I think this separate JIRA for describing the problem with 
> tablet copy is useful, if only as a reference for users.
> Separately, it may be worth addressing these issues for failures that don't 
> result in tserver crashes, such as intermittent network outages between 
> tservers. A long lived tserver won't GC for some time, and it'd be nice to 
> reclaim the disk space used by these orphaned objects in the interim, not to 
> mention that implementing this kind of "GC" for data blocks is a lot easier 
> than a general purpose GC.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1969) Please tidy up incubator distribution files

2017-04-12 Thread Sebb (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebb updated KUDU-1969:
---
Description: 
Please remove the old incubator releases as per:
http://incubator.apache.org/guides/graduation.html#dist

  was:
Please remove the old incubator releases as per:

http://incubator.apache.org/guides/graduation.html
Transferring Resources
Distribution mirrors

 6.   After you have a release at your new home (/dist/${project}/ area), 
remove any distribution artefacts from your old /dist/incubator/${project}/ 
area. Remember from the mirror guidelines that everything is automatically 
added to archive.apache.org anyway.



> Please tidy up incubator distribution files
> ---
>
> Key: KUDU-1969
> URL: https://issues.apache.org/jira/browse/KUDU-1969
> Project: Kudu
>  Issue Type: Bug
> Environment: http://www.apache.org/dist/incubator/kudu/
>Reporter: Sebb
>
> Please remove the old incubator releases as per:
> http://incubator.apache.org/guides/graduation.html#dist



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KUDU-1969) Please tidy up incubator distribution files

2017-04-12 Thread Sebb (JIRA)
Sebb created KUDU-1969:
--

 Summary: Please tidy up incubator distribution files
 Key: KUDU-1969
 URL: https://issues.apache.org/jira/browse/KUDU-1969
 Project: Kudu
  Issue Type: Bug
 Environment: http://www.apache.org/dist/incubator/kudu/
Reporter: Sebb


Please remove the old incubator releases as per:

http://incubator.apache.org/guides/graduation.html
Transferring Resources
Distribution mirrors

 6.   After you have a release at your new home (/dist/${project}/ area), 
remove any distribution artefacts from your old /dist/incubator/${project}/ 
area. Remember from the mirror guidelines that everything is automatically 
added to archive.apache.org anyway.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1968) Aborted tablet copies delete live blocks

2017-04-12 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1968:
-
Code Review: https://gerrit.cloudera.org/#/c/6613/

> Aborted tablet copies delete live blocks
> 
>
> Key: KUDU-1968
> URL: https://issues.apache.org/jira/browse/KUDU-1968
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
>
> 72541b47eb55b2df4eab5d6050f517476ed6d370 (KUDU-1853) caused a serious 
> regression in the case of a failed tablet copy. As of that patch, the 
> following sequence happens:
> - we fetch the remote tablet's metadata, and set our local metadata to match 
> it (including the remote block IDs)
> - as we download blocks, we replace remote block ids with local block IDs
> - if we fail in the middle, we call DeleteTablet
> -- this means that, since we still have some remote block IDs in the 
> metadata, the DeleteTablet call deletes local blocks based on remote block 
> IDs. These block ids are likely to belong to other live tablets locally!
> This can cause pretty serious dataloss, and has the tendency to cascade 
> around a cluster, since later attempts to copy a tablet with missing blocks 
> will get aborted as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (KUDU-1891) Uploading 100,000 rows x 20 columns results in not enough mutation buffer space when uploading data using Python

2017-04-12 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-1891.
---
   Resolution: Not A Problem
Fix Version/s: n/a

> Uploading 100,000 rows x 20 columns results in not enough mutation buffer 
> space when uploading data using Python
> 
>
> Key: KUDU-1891
> URL: https://issues.apache.org/jira/browse/KUDU-1891
> Project: Kudu
>  Issue Type: Bug
>  Components: python
>Affects Versions: 1.2.0
> Environment: Ubuntu 16.04
>Reporter: Roger
> Fix For: n/a
>
>
> The table had one timestamp column and 19 single precision columns with only 
> the timestamp as the primary key.
> The tuples were uploaded in the following way:
> {code}
> table = client.table('new_table')
> session = client.new_session()
> for t in tuples[:10]:
> session.apply(table.new_insert(t))
> {code}
> Please note that the default flush mode in Python is manual.
> This resulted in the bellow error:
> {code}
> ---
> KuduBadStatus Traceback (most recent call last)
>  in ()
>   2 session = client.new_session()
>   3 for t in tuples[:10]:
> > 4 session.apply(table.new_insert(t))
>   5 
>   6 try:
> /root/anaconda3/envs/sifr-repository/lib/python3.5/site-packages/kudu/client.pyx
>  in kudu.client.Session.apply (kudu/client.cpp:15185)()
> /root/anaconda3/envs/sifr-repository/lib/python3.5/site-packages/kudu/client.pyx
>  in kudu.client.WriteOperation.add_to_session (kudu/client.cpp:27992)()
> /root/anaconda3/envs/sifr-repository/lib/python3.5/site-packages/kudu/errors.pyx
>  in kudu.errors.check_status (kudu/errors.cpp:1314)()
> KuduBadStatus: b'Incomplete: not enough mutation buffer space remaining for 
> operation: required additional 225 when 7339950 of 7340032 already used'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1891) Uploading 100,000 rows x 20 columns results in not enough mutation buffer space when uploading data using Python

2017-04-12 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966496#comment-15966496
 ] 

Todd Lipcon commented on KUDU-1891:
---

Yea, I was suggesting that the user use the other mode, not that we make a code 
change. I'll resolve as not-a-bug since it's been several weeks with no 
response.

> Uploading 100,000 rows x 20 columns results in not enough mutation buffer 
> space when uploading data using Python
> 
>
> Key: KUDU-1891
> URL: https://issues.apache.org/jira/browse/KUDU-1891
> Project: Kudu
>  Issue Type: Bug
>  Components: python
>Affects Versions: 1.2.0
> Environment: Ubuntu 16.04
>Reporter: Roger
>
> The table had one timestamp column and 19 single precision columns with only 
> the timestamp as the primary key.
> The tuples were uploaded in the following way:
> {code}
> table = client.table('new_table')
> session = client.new_session()
> for t in tuples[:10]:
> session.apply(table.new_insert(t))
> {code}
> Please note that the default flush mode in Python is manual.
> This resulted in the bellow error:
> {code}
> ---
> KuduBadStatus Traceback (most recent call last)
>  in ()
>   2 session = client.new_session()
>   3 for t in tuples[:10]:
> > 4 session.apply(table.new_insert(t))
>   5 
>   6 try:
> /root/anaconda3/envs/sifr-repository/lib/python3.5/site-packages/kudu/client.pyx
>  in kudu.client.Session.apply (kudu/client.cpp:15185)()
> /root/anaconda3/envs/sifr-repository/lib/python3.5/site-packages/kudu/client.pyx
>  in kudu.client.WriteOperation.add_to_session (kudu/client.cpp:27992)()
> /root/anaconda3/envs/sifr-repository/lib/python3.5/site-packages/kudu/errors.pyx
>  in kudu.errors.check_status (kudu/errors.cpp:1314)()
> KuduBadStatus: b'Incomplete: not enough mutation buffer space remaining for 
> operation: required additional 225 when 7339950 of 7340032 already used'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1968) Aborted tablet copies delete live blocks

2017-04-12 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966491#comment-15966491
 ] 

Todd Lipcon commented on KUDU-1968:
---

I should note that I verified that with a revert, it went back to the old 
behavior of orphaning blocks, as expected, but that's preferable to deleting 
the wrong ones.

> Aborted tablet copies delete live blocks
> 
>
> Key: KUDU-1968
> URL: https://issues.apache.org/jira/browse/KUDU-1968
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
>
> 72541b47eb55b2df4eab5d6050f517476ed6d370 (KUDU-1853) caused a serious 
> regression in the case of a failed tablet copy. As of that patch, the 
> following sequence happens:
> - we fetch the remote tablet's metadata, and set our local metadata to match 
> it (including the remote block IDs)
> - as we download blocks, we replace remote block ids with local block IDs
> - if we fail in the middle, we call DeleteTablet
> -- this means that, since we still have some remote block IDs in the 
> metadata, the DeleteTablet call deletes local blocks based on remote block 
> IDs. These block ids are likely to belong to other live tablets locally!
> This can cause pretty serious dataloss, and has the tendency to cascade 
> around a cluster, since later attempts to copy a tablet with missing blocks 
> will get aborted as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1968) Aborted tablet copies delete live blocks

2017-04-12 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966490#comment-15966490
 ] 

Todd Lipcon commented on KUDU-1968:
---

I'm able to repro with the following sequence:

{code}
rm -Rf /tmp/m /tmp/ts-{1,2,3}
ninja -C build/release kudu-tserver kudu-master kudu

build/latest/bin/kudu-master -fs_wal_dir /tmp/m  &
build/latest/bin/kudu-tserver -fs_wal_dir /tmp/ts-1 
-rpc_bind_addresses=0.0.0.0:7001 -webserver_port=8001 -flush_threshold_secs=10 
-unlock-experimental-flags &
build/latest/bin/kudu-tserver -fs_wal_dir /tmp/ts-2 
-rpc_bind_addresses=0.0.0.0:7002 -webserver_port=8002 -flush_threshold_secs=10 
-unlock-experimental-flags -unlock-unsafe-flags 
-fault-crash-on-handle-tc-fetch-data=0.2 &

sleep 5 # wait for servers to all start

build/latest/bin/kudu test loadgen localhost -keep_auto_table 
-num_rows_per_thread=100

sleep 20 # wait for flush

tablet=$(ls -1 /tmp/ts-2/tablet-meta/* | head -1 | xargs basename)
build/latest/bin/kudu remote_replica copy $tablet localhost:7002 localhost:7001
build/latest/bin/kudu fs check -fs_wal_dir /tmp/ts-1/
{code}

We should revert the patch in trunk and branch-1.3 and release 1.3.1 ASAP.

> Aborted tablet copies delete live blocks
> 
>
> Key: KUDU-1968
> URL: https://issues.apache.org/jira/browse/KUDU-1968
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
>
> 72541b47eb55b2df4eab5d6050f517476ed6d370 (KUDU-1853) caused a serious 
> regression in the case of a failed tablet copy. As of that patch, the 
> following sequence happens:
> - we fetch the remote tablet's metadata, and set our local metadata to match 
> it (including the remote block IDs)
> - as we download blocks, we replace remote block ids with local block IDs
> - if we fail in the middle, we call DeleteTablet
> -- this means that, since we still have some remote block IDs in the 
> metadata, the DeleteTablet call deletes local blocks based on remote block 
> IDs. These block ids are likely to belong to other live tablets locally!
> This can cause pretty serious dataloss, and has the tendency to cascade 
> around a cluster, since later attempts to copy a tablet with missing blocks 
> will get aborted as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KUDU-1968) Aborted tablet copies delete live blocks

2017-04-12 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-1968:
-

 Summary: Aborted tablet copies delete live blocks
 Key: KUDU-1968
 URL: https://issues.apache.org/jira/browse/KUDU-1968
 Project: Kudu
  Issue Type: Bug
  Components: tserver
Affects Versions: 1.3.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker


72541b47eb55b2df4eab5d6050f517476ed6d370 (KUDU-1853) caused a serious 
regression in the case of a failed tablet copy. As of that patch, the following 
sequence happens:

- we fetch the remote tablet's metadata, and set our local metadata to match it 
(including the remote block IDs)
- as we download blocks, we replace remote block ids with local block IDs
- if we fail in the middle, we call DeleteTablet
-- this means that, since we still have some remote block IDs in the metadata, 
the DeleteTablet call deletes local blocks based on remote block IDs. These 
block ids are likely to belong to other live tablets locally!

This can cause pretty serious dataloss, and has the tendency to cascade around 
a cluster, since later attempts to copy a tablet with missing blocks will get 
aborted as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (KUDU-463) Add checksumming to cfile and other on-disk formats

2017-04-12 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke reassigned KUDU-463:


Assignee: Grant Henke  (was: Adar Dembo)

> Add checksumming to cfile and other on-disk formats
> ---
>
> Key: KUDU-463
> URL: https://issues.apache.org/jira/browse/KUDU-463
> Project: Kudu
>  Issue Type: Sub-task
>  Components: cfile, tablet
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
>Assignee: Grant Henke
>  Labels: kudu-roadmap
>
> We should add CRC32C checksums to cfile blocks, metadata blocks, etc, to 
> protect against silent disk corruption. We should probably do this prior to a 
> public release, since it will likely have a negative performance impact, and 
> we don't want to have a public regression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)