[jira] [Updated] (KUDU-2354) In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed

2018-03-15 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2354:

Description: 
In a scenario reported by [~adar], 100 iterations of the following command were 
run:

{noformat}
kudu perf loadgen --keep-auto-table --table-num-buckets=40 
--num-rows-per-thread=1 --table-num-replicas=3
{noformat}

That took about 10-15 minutes to complete, and for some reason ksck reported 
UNAVAILABLE tablets for 5-10 minutes after that.  Most likely, due to the spike 
of IO activity, tablet leaders didn't receive heartbeats from some replicas and 
tried to replace those.  After some time, the cluster has stabilized (no 
problems reported by ksck), but in the master's log the following messages 
continued to appear:

{noformat}
I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending 
ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 
(attempt 22)
I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of 
ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 2776eb10c241426e90ddf7354260ee04 
with cas_config_opid_index -1 with a delay of 60018 ms (attempt = 22)
{noformat}

Of course, in case of just 3 tservers in the cluster not a single attempt to 
add a replacement non-voter replica would succeed, but it would make sense to 
stop retrying those operations when a tablet's OpId index is far ahead of the 
cas_config_opid_index of the operation being retried.

  was:
In a scenario reported by [~adar], 100 iterations of the following command were 
run:

{noformat}
kudu perf loadgen --keep-auto-table --tablet-num-buckets 40 
--num-rows-per-thread=1 --tablet-num-replicas=3
{noformat}

That took about 10-15 minutes to complete, and for some reason ksck reported 
UNAVAILABLE tablets for 5-10 minutes after that.  Most likely, due to the spike 
of IO activity, tablet leaders didn't receive heartbeats from some replicas and 
tried to replace those.  After some time, the cluster has stabilized (no 
problems reported by ksck), but in the master's log the following messages 
continued to appear:

{noformat}
I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending 
ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 
(attempt 22)
I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of 
ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 2776eb10c241426e90ddf7354260ee04 
with cas_config_opid_index -1 with a delay of 60018 ms (attempt = 22)
{noformat}

Of course, in case of just 3 tservers in the cluster not a single attempt to 
add a replacement non-voter replica would succeed, but it would make sense to 
stop retrying those operations when a tablet's OpId index is far ahead of the 
cas_config_opid_index of the operation being retried.


> In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly 
> retries operations to add a replacement replica even if replacement is no 
> longer needed
> ---
>
> Key: KUDU-2354
> URL: https://issues.apache.org/jira/browse/KUDU-2354
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
> Environment: 3 tservers in the cluster, single master (?)
>Reporter: Alexey Serbin
>Priority: Major
>
> In a scenario reported by [~adar], 100 iterations of the following command 
> were run:
> {noformat}
> kudu perf loadgen --keep-auto-table --table-num-buckets=40 
> --num-rows-per-thread=1 --table-num-replicas=3
> {noformat}
> That took about 10-15 minutes to complete, and for some reason ksck reported 
> UNAVAILABLE tablets for 5-10 minutes after that.  Most likely, due to the 
> spike of IO activity, tablet leaders didn't receive heartbeats from some 
> replicas and tried to replace those.  After some time, the cluster has 
> stabilized (no problems reported by ksck), but in the master's log the 
> following messages continued to appear:
> {noformat}
> I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 
> (attempt 22)
> I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of 
> ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 
> 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay 
> of 60018 ms (attempt = 22)
> {noformat}
> Of course, in case of just 3 tservers in the cluster not a single attempt to 
> add a replacement non-voter replica would succeed, but it would make sense to 
> stop retrying those operations when a tablet's OpId index is far ahead of the 
> cas_config_opid_index of the operation being retried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2354) In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed

2018-03-15 Thread Alexey Serbin (JIRA)
Alexey Serbin created KUDU-2354:
---

 Summary: In case of 3-4-3 scheme and 3 tablet servers, catalog 
manager endlessly retries operations to add a replacement replica even if 
replacement is no longer needed
 Key: KUDU-2354
 URL: https://issues.apache.org/jira/browse/KUDU-2354
 Project: Kudu
  Issue Type: Bug
  Components: master
Affects Versions: 1.7.0
 Environment: 3 tservers in the cluster, single master (?)
Reporter: Alexey Serbin


In a scenario reported by [~adar], 100 iterations of the following command were 
run:

{noformat}
kudu perf loadgen --keep-auto-table --tablet-num-buckets 40 
--num-rows-per-thread=1 --tablet-num-replicas=3
{noformat}

That took about 10-15 minutes to complete, and for some reason ksck reported 
UNAVAILABLE tablets for 5-10 minutes after that.  Most likely, due to the spike 
of IO activity, tablet leaders didn't receive heartbeats from some replicas and 
tried to replace those.  After some time, the cluster has stabilized (no 
problems reported by ksck), but in the master's log the following messages 
continued to appear:

{noformat}
I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending 
ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 
(attempt 22)
I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of 
ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 2776eb10c241426e90ddf7354260ee04 
with cas_config_opid_index -1 with a delay of 60018 ms (attempt = 22)
{noformat}

Of course, in case of just 3 tservers in the cluster not a single attempt to 
add a replacement non-voter replica would succeed, but it would make sense to 
stop retrying those operations when a tablet's OpId index is far ahead of the 
cas_config_opid_index of the operation being retried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-428) Support for service/table/column authorization

2018-03-15 Thread Tony Foerster (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Foerster updated KUDU-428:
---
Target Version/s:   (was: Backlog)

> Support for service/table/column authorization
> --
>
> Key: KUDU-428
> URL: https://issues.apache.org/jira/browse/KUDU-428
> Project: Kudu
>  Issue Type: New Feature
>  Components: master, security, tserver
>Affects Versions: 1.2.0
>Reporter: Todd Lipcon
>Priority: Critical
>  Labels: kudu-roadmap
>
> We need to support basic SQL-like access control:
> - grant/revoke on tables, columns
> - service-level grant/revoke
> - probably need some group/role mapping infrastructure as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2350) Kudu C++ client application might fail with SIGPIPE if TLS connection aborted from the tablet server side

2018-03-15 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2350:
--
Component/s: client

> Kudu C++ client application might fail with SIGPIPE if TLS connection aborted 
> from the tablet server side
> -
>
> Key: KUDU-2350
> URL: https://issues.apache.org/jira/browse/KUDU-2350
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: newbie
>
> [~tlipcon]  noticed that {{kudu perf loadgen}} failed with SIGPIPE if the 
> TLS-protected connection terminates abruptly at the server-side.
> Most likely, we miss MSG_NOSIGNAL socket option for TLS sockets.  Setting 
> MSG_NOSIGNAL for client sockets (if possible) or calling 
> {{pthread_sigmask()}} to ignore SIGPIPE could help.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2350) Kudu C++ client application might fail with SIGPIPE if TLS connection aborted from the tablet server side

2018-03-15 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2350:
--
Labels: newbie  (was: )

> Kudu C++ client application might fail with SIGPIPE if TLS connection aborted 
> from the tablet server side
> -
>
> Key: KUDU-2350
> URL: https://issues.apache.org/jira/browse/KUDU-2350
> Project: Kudu
>  Issue Type: Bug
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: newbie
>
> [~tlipcon]  noticed that {{kudu perf loadgen}} failed with SIGPIPE if the 
> TLS-protected connection terminates abruptly at the server-side.
> Most likely, we miss MSG_NOSIGNAL socket option for TLS sockets.  Setting 
> MSG_NOSIGNAL for client sockets (if possible) or calling 
> {{pthread_sigmask()}} to ignore SIGPIPE could help.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-16) Add server-side LIMIT for scanners

2018-03-15 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400834#comment-16400834
 ] 

Todd Lipcon commented on KUDU-16:
-

It appears I started this about 5 years ago but never got far. Here's my WIP 
branch I found lying around:
https://github.com/toddlipcon/kudu/tree/scanner_limit

> Add server-side LIMIT for scanners
> --
>
> Key: KUDU-16
> URL: https://issues.apache.org/jira/browse/KUDU-16
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, perf, tablet, tserver
>Affects Versions: M3
>Reporter: Todd Lipcon
>Assignee: Smyatkin Maxim
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2353) Add tooling to parse diagnostics log

2018-03-15 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2353:
-

 Summary: Add tooling to parse diagnostics log
 Key: KUDU-2353
 URL: https://issues.apache.org/jira/browse/KUDU-2353
 Project: Kudu
  Issue Type: Improvement
  Components: ops-tooling
Reporter: Todd Lipcon
Assignee: Todd Lipcon


KUDU-2297 added a diagnostics log which includes periodic metrics dumps as well 
as stack samples. We have a somewhat-crufty 'parse_metrics_log.py' script which 
no longer works with the new format, and was never particularly good anyway. We 
should add more tools baked into the 'kudu' CLI tool to parse the log and 
extract interesting information such as human-readable stack trace snapshots, 
metrics in TSV form, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2297) Expand metrics logging into a general purpose diagnostics log

2018-03-15 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2297.
---
  Resolution: Fixed
   Fix Version/s: 1.7.0
Target Version/s:   (was: 1.8.0)

Actually since the majority of the work on the log itself is done for 1.7, I'm 
gonna mark this one as closed and add a new JIRA for 1.8 to add some tooling 
around the log.

> Expand metrics logging into a general purpose diagnostics log
> -
>
> Key: KUDU-2297
> URL: https://issues.apache.org/jira/browse/KUDU-2297
> Project: Kudu
>  Issue Type: Improvement
>  Components: ops-tooling, supportability
>Affects Versions: 1.6.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: 1.7.0
>
>
> Currently Kudu servers have the ability to periodically dump metrics to the 
> log. KUDU-2279 improved the compactness and performance of the log so it can 
> be enabled by default.
> In addition to metrics, though, it would be useful to have some other 
> machine-readable data periodically recorded by servers. For example, periodic 
> stack traces or RPC traces could be helpful for understanding latency issues 
> after-the-fact. This JIRA tracks the effort to convert the metrics log to a 
> more general-purpose diagnostics log.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2231) "materializing_iterator_do_pushdown=true" cause simple query slow

2018-03-15 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2231:
--
Fix Version/s: 1.6.1
   1.5.1

> "materializing_iterator_do_pushdown=true" cause simple query slow
> -
>
> Key: KUDU-2231
> URL: https://issues.apache.org/jira/browse/KUDU-2231
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tserver
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
> Environment: CentOS release 6.5 (2.6.32-431.11.9.el6.ucloud.x86_64)
> KUDU-1.4.0-1.cdh5.12.1.p0.10
> IMPALA 2.6.0
> x86-64 
> Intel CPU
>Reporter: DawnZhang
>Assignee: Dan Burkert
>Priority: Major
> Fix For: 1.5.1, 1.7.0, 1.6.1
>
> Attachments: 756ACA6F105F0905EBCB79B940FFCE86.jpg, 
> F8C604537B8E921DDCCA78995DC11BDA.jpg, screenshot-1.png
>
>
> I ran the following SQL again and again
> while refresh 8050/scans page at the same time.
> sql:
> {code:sql}
> select count(xx_id),count(yy_id),count(time) from  test_table  where event_id 
> =29983; 
> {code}
> "Cells read from disk"  is much more greater then table size when 
> materializing_iterator_do_pushdown = true (default).
> after setting materializing_iterator_do_pushdown = false 
> "Cells read from disk" reduced to some reasonable value (close to table size)
> and the  sql run faster.
> here's detail:
> table under test:
> {code:sql}
> CREATE TABLE rawdata.test_table (
>   day INT NOT NULL ENCODING BIT_SHUFFLE COMPRESSION DEFAULT_COMPRESSION,
>   user_id BIGINT NOT NULL ENCODING BIT_SHUFFLE COMPRESSION 
> DEFAULT_COMPRESSION,
>   time TIMESTAMP NOT NULL ENCODING BIT_SHUFFLE COMPRESSION 
> DEFAULT_COMPRESSION,
>   event_id INT NULL ENCODING BIT_SHUFFLE COMPRESSION DEFAULT_COMPRESSION,
>   distinct_id STRING NULL ENCODING DICT_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,
>   ...
>   ...  other fields ...
>   ...
>   PRIMARY KEY (day, user_id, time, _offset)
> )
> PARTITION BY HASH (user_id) PARTITIONS 9
> STORED AS KUDU
> TBLPROPERTIES ( ... );
> {code}
> table size (select count(1) from test_table) : 19510709
> CASE 1, materializing_iterator_do_pushdown = true
> [^756ACA6F105F0905EBCB79B940FFCE86.jpg]
> CASE 2, materializing_iterator_do_pushdown = false (sql ran faster)
> [^F8C604537B8E921DDCCA78995DC11BDA.jpg]
> it looks like kudu scan table multiple times for the simple sql caused by 
> some silly bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2309) /masters can show the wrong list of masters

2018-03-15 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke resolved KUDU-2309.
---
  Resolution: Fixed
   Fix Version/s: 1.7.0
Target Version/s: 1.7.0

Resolved via 
[13fc066|https://github.com/apache/kudu/commit/13fc0666d6eecff021ca92348644d674c994cee1].

> /masters can show the wrong list of masters
> ---
>
> Key: KUDU-2309
> URL: https://issues.apache.org/jira/browse/KUDU-2309
> Project: Kudu
>  Issue Type: Bug
>  Components: ops-tooling
>Affects Versions: 1.6.0
>Reporter: Will Berkeley
>Assignee: Will Berkeley
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: twoleaders2309.png
>
>
> Consider the following steps:
>  # Three masters are started with UUIDs A, B, and C.
>  # A is shut down and its data deleted. A new master with UUID D is started 
> on the same machine.
> After this, visiting /masters on B or C should show A, B, and C as the 
> registered masters, since they are the masters in B and C's quorum. D's 
> /masters should just show D. However, right now we show B, C, D in all three 
> /masters pages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2342) Non-voter replicas can be promoted and get stuck

2018-03-15 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2342:
--
Fix Version/s: 1.7.0

> Non-voter replicas can be promoted and get stuck
> 
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Fix For: 1.7.0, 1.8.0
>
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2259) kudu-spark imports authentication token into client multiple times

2018-03-15 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke resolved KUDU-2259.
---
   Resolution: Fixed
Fix Version/s: 1.8.0
   1.7.0

Resolved via 
[e684de3|https://github.com/apache/kudu/commit/e684de3371941cc5ae8fc4a546ecda7dbe9f4f2f].

> kudu-spark imports authentication token into client multiple times
> --
>
> Key: KUDU-2259
> URL: https://issues.apache.org/jira/browse/KUDU-2259
> Project: Kudu
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 1.6.0
>Reporter: Will Berkeley
>Assignee: Dan Burkert
>Priority: Blocker
> Fix For: 1.7.0, 1.8.0
>
>
> kudu-spark should have one KuduContext per task, which is sent serialized 
> from the driver with an authentication token. The KuduContext either 
> retrieves a Kudu client from a JVM-scoped cache, or creates one and puts it 
> in the cache, and finally imports its authentication token into the client.
> Under default configuration in an un-Kerberized cluster, the client uses the 
> authentication token to connect to the cluster. However, if 
> -rpc_encryption=disabled, then the client will not use the authentication 
> token. This causes the master to issue an authentication token to the client, 
> and the new token replaces the old token in the client.
> While there's one KuduContext per task, multiple tasks may run on the same 
> executor. If this occurs, each KuduContext tries to import its authentication 
> token into the client. If the client has already received a token from the 
> master because encryption is disabled, then it's possible that the 
> KuduContext's token and the master-issued token are for different users, 
> since the KuduContext's token was issued on the driver to the driver's Unix 
> user and the master-issued token is issued to the executor's user.
> An example of the exception that occurred when running spark2-shell as root:
> {noformat}
> 18/01/11 12:14:01 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
> (TID 1, kudu-tserver-01, executor 1): java.lang.IllegalArgumentException: 
> cannot import authentication data from a different user: old='yarn', 
> new='root'
>   at 
> org.apache.kudu.client.SecurityContext.checkUserMatches(SecurityContext.java:128)
>   at 
> org.apache.kudu.client.SecurityContext.importAuthenticationCredentials(SecurityContext.java:138)
>   at 
> org.apache.kudu.client.AsyncKuduClient.importAuthenticationCredentials(AsyncKuduClient.java:677)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.asyncClient$lzycompute(KuduContext.scala:103)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.asyncClient(KuduContext.scala:100)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.syncClient$lzycompute(KuduContext.scala:98)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.syncClient(KuduContext.scala:98)
>   at org.apache.kudu.spark.kudu.KuduRDD.compute(KuduRDD.scala:71)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2343) Java client doesn't properly reconnect to leader master when old leader is online

2018-03-15 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2343:
--
Target Version/s: 1.5.1, 1.7.0, 1.6.1, 1.8.0  (was: 1.3.2, 1.4.1, 1.5.1, 
1.7.0, 1.6.1)
   Fix Version/s: 1.5.1

> Java client doesn't properly reconnect to leader master when old leader is 
> online
> -
>
> Key: KUDU-2343
> URL: https://issues.apache.org/jira/browse/KUDU-2343
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.5.1, 1.7.0, 1.6.1, 1.8.0
>
>
> In the following sequence of events, the Java client doesn't properly fail 
> over to locate a new master, and in fact gets "stuck" until the client is 
> restarted:
> - client connects to the cluster and caches the master locations
> - client opens a table and caches tablet locations
> - the master fails over to a new leader
> - the tablet either goes down or fails over, causing the client to need to 
> update its tablet locations
> In this case, it gets stuck in a retry loop where it will never be able to 
> connect to the new leader master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2343) Java client doesn't properly reconnect to leader master when old leader is online

2018-03-15 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke resolved KUDU-2343.
---
Resolution: Fixed

> Java client doesn't properly reconnect to leader master when old leader is 
> online
> -
>
> Key: KUDU-2343
> URL: https://issues.apache.org/jira/browse/KUDU-2343
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.5.1, 1.7.0, 1.6.1, 1.8.0
>
>
> In the following sequence of events, the Java client doesn't properly fail 
> over to locate a new master, and in fact gets "stuck" until the client is 
> restarted:
> - client connects to the cluster and caches the master locations
> - client opens a table and caches tablet locations
> - the master fails over to a new leader
> - the tablet either goes down or fails over, causing the client to need to 
> update its tablet locations
> In this case, it gets stuck in a retry loop where it will never be able to 
> connect to the new leader master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2342) Non-voter replicas can be promoted and get stuck

2018-03-15 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2342:

  Resolution: Fixed
   Fix Version/s: 1.8.0
Target Version/s: 1.7.0, 1.8.0  (was: 1.7.0)
  Status: Resolved  (was: In Review)

Fixed with:
  a74f9a0dcaf88315c8563b95cdeb5701d9ce5438
  4c1788eae5bfd4cf4a714f1ca0ab775b005303b3
  f2479e21d5a3002ebf5b1012fde83a6cffc2db82

> Non-voter replicas can be promoted and get stuck
> 
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Fix For: 1.8.0
>
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)