[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.

2016-09-06 Thread Juan Yu (Code Review)
Hello Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4080

to look at the new patch set (#4).

Change subject: IMPALA-3912: test_random_rpc_timeout is flaky.
..

IMPALA-3912: test_random_rpc_timeout is flaky.

Datastream sender default timeout is 2 mins which could
block some fragments to complete until the timeout and
cause the metric "num-fragments-in-flight" not back to 0
after 60 seconds.
Decrease the sender timeout to 30 seconds and adding
some logging.

Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
---
M be/src/service/fragment-mgr.cc
M tests/custom_cluster/test_rpc_timeout.py
2 files changed, 9 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/4080/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.

2016-09-06 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3912: test_random_rpc_timeout is flaky.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4080/3//COMMIT_MSG
Commit Message:

PS3, Line 11: datastream sender timeout is 2 mins
> is the problem that some fragment gets blocked waiting for a receiver that 
Good idea, that also avoid increasing the test execution time.


-- 
To view, visit http://gerrit.cloudera.org:8080/4080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.

2016-09-01 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3912: test_random_rpc_timeout is flaky.
..


Patch Set 3: Code-Review+1

Carry previous +1

-- 
To view, visit http://gerrit.cloudera.org:8080/4080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.

2016-09-01 Thread Juan Yu (Code Review)
Hello Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4080

to look at the new patch set (#3).

Change subject: IMPALA-3912: test_random_rpc_timeout is flaky.
..

IMPALA-3912: test_random_rpc_timeout is flaky.

Query might take more than one minute to cleanup and cause
the metric "num-fragments-in-flight" not back to 0. For example,
datastream sender timeout is 2 mins. So increasing the metrics
checking timeout to 3 mins and adding some logging.

Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
---
M be/src/service/fragment-mgr.cc
M tests/custom_cluster/test_rpc_timeout.py
2 files changed, 5 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/4080/3
-- 
To view, visit http://gerrit.cloudera.org:8080/4080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.

2016-09-01 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3912: test_random_rpc_timeout is flaky.
..


Patch Set 2:

I am not able to reproduce the test failure with this test locally and on a 
rhel machine. But I tried with a few other tests. I added same metrics check to 
test_cancellation.py and test_failpoints.py. I hit the sender timeout and the 
metrics check could fail with 60s timeout.

-- 
To view, visit http://gerrit.cloudera.org:8080/4080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.

2016-08-31 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3912: test_random_rpc_timeout is flaky.
..


Patch Set 2: Code-Review+1

Carry Sailesh's +1

-- 
To view, visit http://gerrit.cloudera.org:8080/4080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.

2016-08-31 Thread Juan Yu (Code Review)
Hello Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4080

to look at the new patch set (#2).

Change subject: IMPALA-3912: test_random_rpc_timeout is flaky.
..

IMPALA-3912: test_random_rpc_timeout is flaky.

Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
---
M be/src/service/fragment-mgr.cc
M tests/custom_cluster/test_rpc_timeout.py
2 files changed, 5 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/4080/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.

2016-08-31 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3912: test_random_rpc_timeout is flaky.
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4080/1/be/src/service/fragment-mgr.cc
File be/src/service/fragment-mgr.cc:

Line 103:   VLOG_QUERY << "PlanFragment completed. instance_id="
> nit: I think this fits into a single line. Please ignore if it doesn't
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.

2016-08-22 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3912: test_random_rpc_timeout is flaky.
..


Patch Set 1:

Couldn't find why the metric "num-fragments-in-flight" not back to 0. One 
suspect is a query might take more than one minute to cleanup. for example, 
datastream sender timeout is 2 mins.
So increase the metrics checking timeout and added some logging.

-- 
To view, visit http://gerrit.cloudera.org:8080/4080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.

2016-08-22 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/4080

Change subject: IMPALA-3912: test_random_rpc_timeout is flaky.
..

IMPALA-3912: test_random_rpc_timeout is flaky.

Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
---
M be/src/service/fragment-mgr.cc
M tests/custom_cluster/test_rpc_timeout.py
2 files changed, 6 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/4080/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>


[Impala-ASF-CR] IMPALA-3470: DecompressorTest is flaky.

2016-08-15 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3470: DecompressorTest is flaky.
..


Patch Set 3: Code-Review+2

Rebase, Carry Dan's +2

-- 
To view, visit http://gerrit.cloudera.org:8080/3954
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3470: DecompressorTest is flaky.

2016-08-12 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3470: DecompressorTest is flaky.
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3954/1/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

Line 237: int truncated = rand() % 512 + 1;
> Can we add an ASSERT_LE(truncated, compressed_len)?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3954
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3470: DecompressorTest is flaky.

2016-08-12 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#2).

Change subject: IMPALA-3470: DecompressorTest is flaky.
..

IMPALA-3470: DecompressorTest is flaky.

Make sure the random number is greater than 0 and stream is indeed truncated.

Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a
---
M be/src/util/decompress-test.cc
1 file changed, 2 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/3954/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3954
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>


[Impala-ASF-CR] IMPALA-3470: DecompressorTest is flaky.

2016-08-12 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3954

Change subject: IMPALA-3470: DecompressorTest is flaky.
..

IMPALA-3470: DecompressorTest is flaky.

Make sure the random number is greater than 0 and stream is indeed truncated.

Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a
---
M be/src/util/decompress-test.cc
1 file changed, 1 insertion(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/3954/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3954
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Juan Yu <j...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-07-08 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 21:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/3343/21/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS21, Line 227: is
> delete "is"
Done


Line 304:   TNetworkAddress address_;
> can't we get this from client_->address()?
"client_" is not an instance of ThriftClientImpl


http://gerrit.cloudera.org:8080/#/c/3343/21/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

PS21, Line 134: 30
> Is there a short comment you could write to justify how this was chosen (5 
Done


PS21, Line 134: The time after "
  : "which a backend client send/recv RPC call will timeout.
> The send/recv connection timeout in milliseconds for a backend client RPC.
This is the underlying TSocket send/recv call timeout, not connection timeout.


PS21, Line 138:  
> same
Done


PS21, Line 157: 0
> why is this 0? (wait_ms)
This is for retry opening connection, usually each retry will take several 
seconds. waiting even longer won't help much.


PS21, Line 162: 100
> how was this chosen?
I'll set this to 0.


Line 223: "", !FLAGS_ssl_client_ca_certificate.empty())),
> not your change, but it's really unfortunate we duplicate this code. let's 
I'll add a Todo here.


http://gerrit.cloudera.org:8080/#/c/3343/21/be/src/testutil/fault-injection-util.h
File be/src/testutil/fault-injection-util.h:

PS21, Line 36: RPC_RANDOM
> comment that this must be last
Done


PS21, Line 39: call
> delete
Done


PS21, Line 40: timeout
> this is the recv connection timeout, correct? if so, how about saying "recv
Done


PS21, Line 41: RpcCallType my_type, int32_t rpc_type, int32_t delay_ms
> document these.
Done


PS21, Line 44: rpc_type == RPC_NULL
> what is specifying RPC_NULL used for?
Just for easy testing, you can easily enable disable the fault injection by 
changing the value, no need to add/remove the startup flag. In the future, we 
could change this value dynamically to do more testing.


Line 50:   FLAGS_fault_injection_rpc_type, 
FLAGS_fault_injection_rpc_delay_ms)
> why pass these as arguments rather than just having InjectRpcDelay() read t
Similar reason as above, we could test with dynamic values without the need to 
restart cluster.


http://gerrit.cloudera.org:8080/#/c/3343/21/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

Line 119: self.execute_query_verify_metrics(self.TEST_QUERY, 10)
> how long do all these tests take to execute?  let's run them only in exhaus
About 5 minutes. ok, I'll change to only execute in exhaustive mode.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 21
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-07-07 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 20:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS20, Line 224: indicate
> Right, but in the case of success, it's not really a "retry". That's just s
Done


Line 265:   } catch (apache::thrift::TException& e) {
> I'm fine with leaving this alone for now but it seems a little weird.
Let me add a Todo here.


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

PS20, Line 231: timeout_ms
> It'd be preferred to remove the timeout_ms param and this extra logic until
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 20
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-07-06 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/service/fragment-mgr.cc
File be/src/service/fragment-mgr.cc:

Line 38:   FAULT_INJECTION_RPC_DELAY(RPC_EXECPLANFRAGMENT);
> is there a reason we decided to put these here rather than in impala-intern
No, I forget we have a proxy for those RPC call. You are right, it's better to 
put them there.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 20
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-07-06 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 20:

(23 comments)

http://gerrit.cloudera.org:8080/#/c/3343/20//COMMIT_MSG
Commit Message:

Line 12: Impala doesn't set socket send/recv timeout for backend client.
> Prior to this change, ...
Done


PS20, Line 24: new RPC call RetryRpcRecv()
> This makes it sound like RetryRpcRecv() is a new RPC.  I haven't looked at 
Done


PS20, Line 28: call
> delete (C of RPC stands for call)
Done


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/common/global-flags.cc
File be/src/common/global-flags.cc:

PS20, Line 102: call
> delete
Done


PS20, Line 102: which trigger caller hits RPC timeout.
> to trigger an RPC timeout on the caller side.
Done


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.cc
File be/src/runtime/client-cache.cc:

Line 162:   client_map_.erase(client);
> is it legal to use the 'client' iterator even though the lock was dropped i
http://kera.name/articles/2011/06/iterator-invalidation-rules/
For insert, iterator is not affected. For Erasure, iterator is only affected to 
the erased elements.
client is not shared so there shouldn't be more than one thread try to destroy 
the same client. right?


Line 163:   *client_key = NULL;
> why doesn't this method need to remove from the per_host_caches_ as well? T
Connection cannot be shared. when a connection is being use, it was taken out 
from per_host_caches. so we just don't put it back.
The header comment means client_map_. I'll clarify that.


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS20, Line 96: it
> what is "it"? the client? or the connection?  does this NULL out *client_ke
"it" means the client.  DestroyClient NULL out *client_key like Release Client()


PS20, Line 96: cache
> is this "the per-host cache" that is referenced everywhere else, or somethi
Done


PS20, Line 224: indicate
> indicates
if returned status is okay, means both RPC are completed. can or cannot retry 
really depends on the data being sent here.
If this is TransmiteData, retry is not safe
but if this is ReportStatusCb, retry is fine.
Only caller knows that info.


Line 225:   /// the whole RPC call.
> explain this more.  is it because if the send RPC call failed, then the han
that's right.


Line 260:   if (!status.ok()) {
> so this is the only path where we have *retry_is_safe == true and an error 
Done


Line 265:   } catch (apache::thrift::TException& e) {
> why does this path not need the timeout check also?
I think the original logic is to give it a second chance but if it fails again, 
just fail the whole RPC.


Line 271: client_is_unrecoverable_ = false;
> why is this logic needed now, whereas it wasn't needed before? or is this a
yes, it's another bug fix, I need to fix it in this patch, otherwise, the 
broken connection will cause lots of problem.
IMPALA-2864


PS20, Line 301: the
> whether the
see my previous patch and Henry's comment
"last_rpc_succeeded_" matches the code logic better.
"client_is_unrecoverable_" to more clearly describe the state
not sure which one is better.


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

Line 115:   ImpalaBackendClientCache* client_cache_;
> now that we store the RuntimeState, let's get rid of this and just get the 
Done


PS20, Line 153: wait
> retry waiting for the response?
Done


PS20, Line 153: call
> delete "call", and would be nice to write RPC
Done


PS20, Line 231: timeout_ms
> where is this called with a non-zero timeout_ms?
For now, we like to wait indefinitely to avoid logic change.
But in the future, even after replace RPC implementation, it's better to have a 
reasonable timeout for all RPCs.


Line 239:   timeout_ms));
> this will throw away the original status message.  Maybe use AddDetail() in
Done


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/service/fragment-exec-state.cc
File be/src/service/fragment-exec-state.cc:

Line 125: if (!retry_is_safe) break;
> so i guess this was a bug in the old code?
yes, But for ReportExecStatus RPC, resend shouldn't cause any issue. handler 
just process the same status multiple times.


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/statestore/statestore.cc
File be/src/statestore/statestore.cc:

PS20, Line 658: Rewrite
> Add details to
Done


PS20, Line 669: Rewrite
> same
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 20
Gerrit-Project: Impa

[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-07-02 Thread Juan Yu (Code Review)
Hello Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3343

to look at the new patch set (#19).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang.
For catalog client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting for
HMS API call.

Added a new RPC call RetryRpcRecv() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response to add back pressure.

If an RPC call fails, we don't put the underlying connection back to
cache but close it. This is to make sure bad state of this connection
won't cause more RPC failure.

Added retry for CancelPlanFragment RPC. This reduces the chance that cancel
request gets lost due to unstable network, but this can cause cancellation
takes longer time. and make test_lifecycle.py more flaky.
The metric num-fragments-in-flight might not be 0 yet due to previous tests.
Modified the test to check the metric delta instead of comparing to 0 to
reduce flakyness. However, this might not capture some failures.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/service/fragment-exec-state.cc
M be/src/service/fragment-mgr.cc
M be/src/service/impala-server.cc
M be/src/statestore/statestore.cc
A be/src/testutil/fault-injection-util.h
M be/src/util/error-util-test.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
M tests/query_test/test_lifecycle.py
M tests/verifiers/metric_verifier.py
18 files changed, 378 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/19
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 19
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-07-02 Thread Juan Yu (Code Review)
Hello Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3343

to look at the new patch set (#18).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang.
For catalog client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting for
HMS API call.

Added a new RPC call RetryRpcRecv() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response to add back pressure.

If an RPC call fails, we don't put the underlying connection back to
cache but close it. This is to make sure bad state of this connection
won't cause more RPC failure.

Added retry for CancelPlanFragment RPC. This reduces the chance that cancel
request gets lost due to unstable network, but this can cause cancellation
takes longer time. and make test_lifecycle.py more flaky.
The metric num-fragments-in-flight might not be 0 yet due to previous tests.
Modified the test to check the metric delta instead of comparing to 0 to
reduce flakyness. However, this might not capture some failures.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/service/fragment-exec-state.cc
M be/src/service/fragment-mgr.cc
M be/src/service/impala-server.cc
M be/src/statestore/statestore.cc
A be/src/testutil/fault-injection-util.h
M be/src/util/error-util-test.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
M tests/query_test/test_lifecycle.py
M tests/verifiers/metric_verifier.py
18 files changed, 378 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/18
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 18
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-07-02 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 17:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS17, Line 203: if (client_ != NULL) {
  :   if (!has_rpc_error_) {
  : client_cache_->ReleaseClient(_);
  :   } else {
  : client_cache_->DestroyClient(_);
  :   }
  : }
> is this path covered by your tests - i.e. the connection will be reopened i
yes, the test_random_rpc_timeout covers this. It runs multiple queries and each 
will hit rpc timeout, then the connection will be destroyed. The following 
query will create new connections.


Line 301:   /// Indicate the rpc call sent by this connection succeeds or not. 
If the rpc call fails
> last rpc call?
Done


PS17, Line 304: has_rpc_error_
> I would call this something more descriptive, like "cnxn_not_recoverable_".
How about just "last_rpc_succeed_" so it matches the logic, less confusion.


http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

PS17, Line 233: (timeout_ms == 0) ? 0 : 
> can remove this, and just check timeout_ms in the while condition.
Done


PS17, Line 236: deadline
> timeout_ms
Done


http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/service/fragment-exec-state.cc
File be/src/service/fragment-exec-state.cc:

PS17, Line 116: can_retry
> call this "retry_is_safe" ?
Done


http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/service/fragment-mgr.cc
File be/src/service/fragment-mgr.cc:

PS17, Line 37: #ifndef NDEBUG
 :   DECLARE_int32(fault_injection_rpc_delay_ms);
 :   DECLARE_int32(fault_injection_rpc_type);
 :   #define FAULT_INJECTION_RPC_DALAY(type) InjectRpcDelay(type, \
 :   FLAGS_fault_injection_rpc_type, 
FLAGS_fault_injection_rpc_delay_ms)
 : #else
 :   #define FAULT_INJECTION_RPC_DALAY(type)
 : #endif
> would it be better to put this in fault-injection-util.h?
Figured out the compiling issue.
Yes, better to put this in fault-injection-util.h and avoid duplicate code.


http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

PS17, Line 186: DALAY
> DELAY
Done


http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/testutil/fault-injection-util.h
File be/src/testutil/fault-injection-util.h:

PS17, Line 15: _
> nit: remove trailing _
Done


http://gerrit.cloudera.org:8080/#/c/3343/17/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

PS17, Line 34: i
> nit: use _ to show that we don't care about the variable
Done


PS17, Line 61: test_runtime_filter_rpc
> won't pytest try to run this as a test, since it's called test_*?
it does. I'll change the name.


PS17, Line 72: 3000
> is this definitely large enough to time out? I don't see if you verify that
yes, since backend_client_rpc_timeout_ms is set to 1000.
I verified log each RPC call did hit the timeout. however hit timeout doesn't 
necessary lead to query completely failure. It trigger fragment cancel, 
Cancel() just set runtime state cancel flag, doesn't force query execution 
stop. by the time execNode check cancel flag, the execution could be already 
complete and query finish successfully. Also some rpc failure not affect query 
execution, like the runtime filter one.
The only one can guarantee query failure is the ExecPlanFragment(), I'll check 
query failure for it.


PS17, Line 72: -
> just want to check this works as intended with a missing '-'
yes, both "-" and "--" work. but I'll make them consistent.


PS17, Line 73: rpc_timeout1
> nit: can you call these "test_execplanfragment_timeout" etc?
Done


http://gerrit.cloudera.org:8080/#/c/3343/17/tests/query_test/test_lifecycle.py
File tests/query_test/test_lifecycle.py:

PS17, Line 36: \
> not needed
Done


PS17, Line 52: \
> remove
Done


http://gerrit.cloudera.org:8080/#/c/3343/17/tests/verifiers/metric_verifier.py
File tests/verifiers/metric_verifier.py:

PS17, Line 68: wait_for_metric_reset
> Is this class really necessary, or could we use MetricVerifier and always p
we usually need to check metrics for all impalad. I'll modify MetricVerifier to 
keep initial values.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 17
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Ch

[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-30 Thread Juan Yu (Code Review)
Hello Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3343

to look at the new patch set (#17).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang.
For catalog client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting for
HMS API call.

Added a new RPC call RetryRpcRecv() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response to add back pressure.

If an RPC call fails, we don't put the underlying connection back to
cache but close it. This is to make sure bad state of this connection
won't cause more RPC failure.

Added retry for CancelPlanFragment RPC. This reduces the chance that cancel
request gets lost due to unstable network, but this can cause cancellation
takes longer time. and make test_lifecycle.py more flaky.
The metric num-fragments-in-flight might not be 0 yet due to previous tests.
Modified the test to check the metric delta instead of comparing to 0 to
reduce flakyness. However, this might not capture some failures.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/service/fragment-exec-state.cc
M be/src/service/fragment-mgr.cc
M be/src/service/impala-server.cc
M be/src/statestore/statestore.cc
A be/src/testutil/fault-injection-util.h
M be/src/util/error-util-test.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
M tests/query_test/test_lifecycle.py
M tests/verifiers/metric_verifier.py
18 files changed, 392 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/17
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 17
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-30 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 16:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/3343/16/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 226:   /// Returns RPC_RECV_TIMEOUT if a timeout occurred, 
RPC_CLIENT_CONNECT_FAILURE if the
> "while waiting for a response"
Done


Line 254:   << e.what() << ", type=" << typeid(e).name();
> nit: align << with above
Done


PS16, Line 274: or certain RPCs, server could take long time to process it and 
intentionally block
  :   /// the RPC response to add back pressure. In those cases, 
client needs to retry recv
  :   /// RPC call to wait longer.
> suggest: "In certain cases, the server may take longer to provide an RPC re
Done


PS16, Line 299: /// Indicate the rpc call sent by this connection succeeds or 
not. If the rpc call fails
  :   /// for any reason, the connection could be left in a bad 
state and cannot be reused any
  :   /// more.
  :   bool has_rpc_error_;
> Why not proactively destroy the connection when this condition is hit?
As discussed offline, I don't want to change destroy the connection right away 
because the retry logic in DoRpc.


http://gerrit.cloudera.org:8080/#/c/3343/16/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

PS16, Line 228: DoRpcTimedWait
> Call this something less generic? Even DoTransmitDataRpc() would work.
Done


PS16, Line 232: (timeout_ms == 0) ? 0 : 
> if you just MonotonicMillis() + timeout_ms, and check for MonotonicMillis()
timeout == 0 means wait infinitely.
I add comments for that.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 16
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-29 Thread Juan Yu (Code Review)
Hello Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3343

to look at the new patch set (#16).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang.
For catalog client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting for
HMS API call.

Added a new RPC call RetryRpcRecv() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response to add back pressure.

If an RPC call fails, we don't put the underlying connection back to
cache but close it. This is to make sure bad state of this connection
won't cause more RPC failure.

Added retry for CancelPlanFragment RPC. This reduces the chance that cancel
request gets lost due to unstable network, but this can cause cancellation
takes longer time. and make test_lifecycle.py more flaky.
The metric num-fragments-in-flight might not be 0 yet due to previous tests.
Modified the test to check the metric delta instead of comparing to 0 to
reduce flakyness. However, this might not capture some failures.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M be/src/util/error-util-test.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
M tests/query_test/test_lifecycle.py
M tests/verifiers/metric_verifier.py
14 files changed, 237 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/16
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 16
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-29 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 15:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/3343/15/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 97:   void DestroyClient(ClientKey* client_key);
> needs a comment
Done


PS15, Line 197: hasRpcError_
> C++-style naming
Done


PS15, Line 239: IsTimeoutTException
> should now be called IsRecvTimeoutTException()
Done


PS15, Line 240: RPC_TIMEOUT
> should be RPC_RECV_TIMEOUT
Done


PS15, Line 246: if (strstr(e.what(),"unknown result") != NULL)
> this seems very brittle, and at least should be wrapped in an IsXXXExceptio
Done


PS15, Line 268: release
> back pressure is added by a blocking RPC, not released.
Done


PS15, Line 273: timeout
> what's the unit?
Done


PS15, Line 272: Status DoRpcTimedWait(const F& f, const Request& request, 
Response* response,
  :   const G& wait_func, uint64_t timeout, RuntimeState* 
state, bool* can_retry = NULL) 
> This seems like it breaks abstractions: not all RPCs happen in the context 
I agree with you. I did it this way in patch#6. but had to duplicate the wait 
response loop for several rpc calls so I changed it later.


PS15, Line 282: bool no_timeout = timeout == 0;
> this can be simplified by having a deadline variable:
Done


Line 290:   if (!IsTimeoutTException(e)) {
> Just curious, which function in recv_TransmitData() throws timeout exceptio
The exception is thrown by thrift, see TSocket.cpp read()
recv_TransmitData() will call thrift api to read data from server side.


PS15, Line 309: bool hasRpcError_;
> the role of this variable is not clear. Please add a comment.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-28 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 15:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3343/15/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS15, Line 1490: // Try to send the RPC 3 times before failing.
> Why try 3 times? Have you seen in your testing where there's failure on the
This is to increase the chance the cancel request can reach remote nodes to 
avoid orphan fragments. If network is not stable, we could get "send expire" 
error on the coordinator to remote node connection, but the report status 
callback might keep working so remote nodes don't aware there is connection 
issue with coordinator.
Though DoRpc() will always retry once, in the situation of connection storm, it 
might not be able to create a new connection at first retry. 
If you are unlucky, you could get a closed connection from cache (this could 
happen if CreateClient() in ClientCacheHelper::ReopenClient() fails for 
previous RPC call). then the cancel request might not get a chance to send out.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-24 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 15:

exhaustive test and stress test are green.
http://sandbox.jenkins.cloudera.com/view/Impala/view/Private-Utility/job/impala-private-build-and-test/3511/
http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/Impala-Stress-Test-Physical/615/

-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-24 Thread Juan Yu (Code Review)
Hello Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3343

to look at the new patch set (#15).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang.
For catalog client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting for
HMS API call.

Added a new RPC call DoRpcTimedWait() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response due to back pressure.

If an RPC call fails, we don't put the underlying connection back to
cache but close it. This is to make sure bad state of this connection
won't cause more RPC failure.

Added retry for CancelPlanFragment RPC. This reduces the chance that cancel
request gets lost due to unstable network, but this can cause cancellation
takes longer time. and make test_lifecycle.py more flaky.
The metric num-fragments-in-flight might not be 0 yet due to previous tests.
Modified the test to check the metric delta instead of comparing to 0 to
reduce flakyness. However, this might not capture some failures.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
M tests/query_test/test_lifecycle.py
M tests/verifiers/metric_verifier.py
11 files changed, 205 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/15
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-22 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 14: Code-Review+1

Carry Sailesh's +1

-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 14
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-22 Thread Juan Yu (Code Review)
Hello Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3343

to look at the new patch set (#14).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread does not quiting even after query is cancelled.
For catalog client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting for
HMS API call.

Added a new RPC call DoRpcTimedWait() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response due to back pressure.

If an RPC call fails, we don't put the underlying connection back to
cache but close it. This is to make sure bad state of this connection
won't cause more RPC failure.

Added retry for CancelPlanFragment RPC. This reduces the chance that cancel
request gets lost due to unstable network, but this can cause cancellation
takes longer time. and make test_lifecycle.py more flaky.
The metric num-fragments-in-flight might not be 0 yet due to previous tests.
Modified the test to check the metric delta instead of comparing to 0 to
reduce flakyness. However, this might not capture some failures.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
M tests/query_test/test_lifecycle.py
M tests/verifiers/metric_verifier.py
12 files changed, 222 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/14
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 14
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-22 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 13:

(1 comment)

I added a section in commit message. Is it enough for explain the test change?
"Added retry for CancelPlanFragment RPC. This reduces the chance that cancel
request gets lost due to unstable network, but this can cause cancellation
takes longer time. and make test_lifecycle.py more flaky.
The metric num-fragments-in-flight might not be 0 yet due to previous tests.
Modified the test to check the metric delta instead of comparing to 0 to
reduce flakyness. However, this might not capture some failures.
"

http://gerrit.cloudera.org:8080/#/c/3343/13/be/src/runtime/plan-fragment-executor.cc
File be/src/runtime/plan-fragment-executor.cc:

PS13, Line 560: (!done_ && (closed_ || is_cancelled_))
> It would be useful to add a comment here saying that done_ should not be tr
will do.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 13
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-22 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#13).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread does not quiting even after query is cancelled.
For catalog client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting for
HMS API call.

Added a new RPC call DoRpcTimedWait() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response due to back pressure.

If an RPC call fails, we don't put the underlying connection back to
cache but close it. This is to make sure bad state of this connection
won't cause more RPC failure.

Added retry for CancelPlanFragment RPC. This reduces the chance that cancel
request gets lost due to unstable network, but this can cause cancellation
takes longer time. and make test_lifecycle.py more flaky.
The metric num-fragments-in-flight might not be 0 yet due to previous tests.
Modified the test to check the metric delta instead of comparing to 0 to
reduce flakyness. However, this might not capture some failures.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
M tests/query_test/test_lifecycle.py
M tests/verifiers/metric_verifier.py
12 files changed, 219 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/13
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 13
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-22 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 12:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/3343/12/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS12, Line 1494: NULL
> Can't we pass runtime_state() here?
We can, but I don't think we need to check it. even if the query is cancelled, 
we still want to send the cancel request to remote node.
And this one has a very short timeout, the rpc won't hang for long time like 
the one in TransmitData().


http://gerrit.cloudera.org:8080/#/c/3343/12/be/src/service/fragment-exec-state.cc
File be/src/service/fragment-exec-state.cc:

PS12, Line 120: NULL
> Same here. Can't we pass executor_.runtime_state() here?
see L#93, runtime state may not have been set.
and same as CancelRemoteFragment request, this one only has short timeout, it 
won't hang for long.


PS12, Line 122: Status(res.status)
> Will this be valid when '!can_retry' is true?
Good catch, I'll fix it.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 12
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-22 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#12).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread does not quiting even after query is cancelled.
For catalog client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting for
HMS API call.

Added a new RPC call DoRpcTimedWait() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response due to back pressure.

If an RPC call fails, we don't put the underlying connection back to
cache but close it. This is to make sure bad state of this connection
won't cause more RPC failure.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
M tests/query_test/test_lifecycle.py
M tests/verifiers/metric_verifier.py
12 files changed, 219 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/12
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 12
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-22 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3343/11/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS11, Line 272: bool* can_retry = NULL
> Would it be easy to just pass a 'int num_retry' here, so that the retry loo
I am afraid that way this function does too much.
I prefer to leave the retry to caller, so caller can decide to use the same 
connection or a new connection, or adding some delay between retry.


PS11, Line 281: //status = WaitRpcResp(wait_func, response, state, timeout);
> Is this logic complete? Or will you be posting another patch?
sorry forget to remove the old code.
and I do need to post a new patch. just found that RuntimeState might not 
always be available.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-21 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#11).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread does not quiting even after query is cancelled.
For catalog client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting for
HMS API call.

Added a new RPC call WaitRpcResp() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response due to back pressure.

If an RPC call fails, we don't put the underlying connection back to
cache but close it. This is to make sure bad state of this connection
won't cause more RPC failure.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
M tests/query_test/test_lifecycle.py
M tests/verifiers/metric_verifier.py
12 files changed, 219 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/11
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-21 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 10:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/3343/10/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS10, Line 1495: rpc_status = backend_client.WaitRpcResp(
   : ::recv_CancelPlanFragment, 
, runtime_state(), 100);
> By this stage, the send() part of the RPC has already succeeded right? So, 
yes, you're right. I like the idea of adding a retry flag.


http://gerrit.cloudera.org:8080/#/c/3343/10/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

PS10, Line 230: Status 
DataStreamSender::Channel::DoRpcTimedWait(ImpalaBackendConnection* client,
  : const TTransmitDataParams& params, TTransmitDataResult* res,
  : uint64_t timeout) {
  :   Status status =
  :   client->DoRpc(::TransmitData, params, 
res);
  :   if (status.code() == TErrorCode::RPC_TIMEOUT) {
  : status = 
client->WaitRpcResp(::recv_TransmitData, res,
  : runtime_state_, ONE_HOUR_IN_MS);
  :   }
  :   return status;
  : }
> This would be more cleaner if done as a function in client-cache rather tha
Done


http://gerrit.cloudera.org:8080/#/c/3343/10/be/src/runtime/plan-fragment-executor.cc
File be/src/runtime/plan-fragment-executor.cc:

PS10, Line 560: if (IsCompleted()) {
> What if 'done_' is true here?
when done_ is set, PlanFragmentExecutor::FragmentComplete() is always called 
and status is reported back to coordinator and report thread will be stopped.
check done_ here could accidentally stop report thread too soon and cause the 
fragment completed status not send to coordinator.


http://gerrit.cloudera.org:8080/#/c/3343/10/be/src/runtime/plan-fragment-executor.h
File be/src/runtime/plan-fragment-executor.h:

PS10, Line 146: IsCompleted()
> This technically should not be IsCompleted() because we check if (!done_). 
since it's only used once, I'll just check the vars directly instead of using a 
function.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-20 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#10).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread does not quiting even after query is cancelled.
For catalog client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting for
HMS API call.

Added a new RPC call WaitRpcResp() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response due to back pressure.

If an RPC call fails, we don't put the underlying connection back to
cache but close it. This is to make sure bad state of this connection
won't cause more RPC failure.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/runtime/plan-fragment-executor.h
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
11 files changed, 206 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/10
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-20 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 9:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/3343/9//COMMIT_MSG
Commit Message:

PS9, Line 9: configable
> configurable
Done


PS9, Line 21: keep default timeout to 0
> maybe mention in brackets that this means the timeouts are disabled.
Done


Line 24: 
> Also mention the WaitRpcResp()
Done


PS9, Line 25: we don't put it back to cache
> "we don't put the underlying connection back in the cache"
Done


http://gerrit.cloudera.org:8080/#/c/3343/9/be/src/runtime/client-cache.cc
File be/src/runtime/client-cache.cc:

Line 147:   shared_ptr client_impl;
> DCHECK(*client_key != NULL)
Done


http://gerrit.cloudera.org:8080/#/c/3343/9/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

PS9, Line 306:   Status status = 
client.DoRpc(::TransmitData, params, );
 :   if (status.code() == TErrorCode::RPC_TIMEOUT) {
 : status = 
client.WaitRpcResp(::recv_TransmitData, ,
 : runtime_state_, ONE_HOUR_IN_MS);
 :   }
 :   rpc_status_ = status;
> This code seems to be reused in many places. Maybe you can create a wrapper
Done


http://gerrit.cloudera.org:8080/#/c/3343/9/be/src/runtime/plan-fragment-executor.cc
File be/src/runtime/plan-fragment-executor.cc:

PS9, Line 456: Note
> "Note:"
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 9
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-20 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3343/9/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

PS9, Line 216: ONE_HOUR_IN_MS
> How was one hour chosen? I'm worried this might cause lot of queries to han
yes, that could happen. but we don't want very short timeout here. upstream 
operator sometimes could take very long time and I don't know how long it could 
be. 
This is like the last mean to detect the pair node failure. In most of cases, 
the dest node failure will be detected by statestore heartbeat and the query 
will be cancelled before hitting this timeout.

RPC_TIMEOUT error is checked by IsTimeoutTException(), only if the error 
contains "EAGAIN (timed out)".
It happens only on TSocket::read()


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 9
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-19 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#9).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread does not quiting even after query is cancelled.
For catalog client, keep default timeout to 0 because ExecDdl() could take
very long time if table has many partitons, mainly waiting for HMS
API call.

If an RPC call fails, we don't put it back to cache but close it. This is to
make sure bad state in this connection won't cause more RPC failure.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/runtime/plan-fragment-executor.h
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
11 files changed, 200 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/9
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 9
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-17 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#8).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread does not quiting even after query is cancelled.
For catalog client, keep default timeout to 0 because ExecDdl() could take
very long time if table has many partitons, mainly waiting for HMS
API call.

If an RPC call fails, we don't put it back to cache but close it. This is to
make sure bad state in this connection won't cause more RPC failure.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-mgr.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/runtime/plan-fragment-executor.h
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
12 files changed, 204 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/8
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-16 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#7).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread does not quiting even after query is cancelled.
For catalog client, keep default timeout to 0 because ExecDdl() could take
very long time if table has many partitons, mainly waiting for HMS
API call.

If an RPC call fails, we don't put it back to cache but close it. This is to
make sure bad state in this connection won't cause more RPC failure.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-mgr.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/runtime/plan-fragment-executor.h
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
12 files changed, 205 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/7
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-16 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#6).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread does not quiting even after query is cancelled.
For catalog client, keep default timeout to 0 because ExecDdl() could take
very long time if table has many partitons, mainly waiting for HMS
API call.

If an RPC call fails, we don't put it back to cache but close it. This is to
make sure bad state in this connection won't cause more RPC failure.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-mgr.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/runtime/plan-fragment-executor.h
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
12 files changed, 203 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/6
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-14 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/runtime/backend-client.h
File be/src/runtime/backend-client.h:

PS5, Line 48: // Consider back pressure, we should wait for receiver side as 
long as possible
: // but not forever. Also we need to make sure Cancel() can 
stop the waiting.
> I preferred the old way where you could set per-RPC timeouts. You can combi
I have the same concern as Sailesh, there might be performance impact if we set 
timeout per RPC. Also maintain a different timeout for each type of RPC could 
be hard.
I agree move retry to client is better.


http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS5, Line 240: Reopen()
This could causing lots of issues. Without checking exception details, we don't 
know if the exception is from socket send() or recv(), simply retry both rpc 
could lead to send the request multiple times. For example, 
ExecutePlanFragment() called twice for the same fragment, the same row batch is 
sent twice, etc.


http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

PS5, Line 1490: // Try to send the RPC 3 times before failing.
  : for (int i = 0; i < 3; ++i) {
> This always sends the RPC three times if it succeeds every time.
You're right, I should check if status is ok.


PS5, Line 1494: if (rpc_status.code() == TErrorCode::RPC_TIMEOUT) break;
> If you abort after a timeout error, what's the rationale for sending the RP
IsTimeoutTException() checks for a specific timedout that can only happen at 
socket recv(), which means send() is already done. remote node gets the cancel 
request, no need to retry.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-13 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#5).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configable timeout for all backend client
RPC calls to avoid query hang issue.

Impala doesn't set socket send/recv timeout for backend client.
RPC calls will wait forever for data. In extreme case of bad network,
or destination host has kernel panic, sender will not get response
and rpc call will hang. Query hang is hard to detect. if hang happens
at ExecRemoteFragment() or CancelPlanFragments(), query cannot be
canelled unless you restart coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread does not quiting even after query is cancelled.
For catalog client, keep default timeout to 0 because ExecDdl() could take
very long time if table has many partitons, mainly waiting for HMS
API call.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/runtime/plan-fragment-executor.h
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
11 files changed, 164 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/5
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-13 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/client-cache.cc
File be/src/runtime/client-cache.cc:

Line 125: void ClientCacheHelper::ReleaseClient(ClientKey* client_key) {
> I think it would be better to reset the timeouts to the default values when
Good idea.
I am still running tests to see if we do need different timeout for different 
RPC calls.


http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS4, Line 101: SetRpcTimeout
> Would this be clearer as SetClientRpcTimeout() here and below? On first loo
agreed


PS4, Line 239: SetRpcTimeout(send_timeout, recv_timeout);
> I'm a little worried about calling this every time we call a DoRpc() since 
You're right, set the timeout per client makes sense.


http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/plan-fragment-executor.cc
File be/src/runtime/plan-fragment-executor.cc:

PS4, Line 556: is_cancelled_
> Can this also be IsCompleted()? Is it a possibility that there is a sequenc
StopReportThread is called if fragment is completed, or cancel happened during 
GetNext(). 
If an internal error happened at other time, it usually trigger internal 
cancelling, Cancel() will be called, 
But I agree, IsCompleted() seems a safer check here.


Line 559: UpdateStatus(Status(TErrorCode::CANCELLED, "Fragment already 
cancelled."));
> Should this be an error? Or is it better to just return since the user will
I don't think this should be treat as error. the only issue is the report 
thread keep running and send status to coordinator which is confusing. 
I'll just call StopReportThread().


http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/service/fragment-exec-state.cc
File be/src/service/fragment-exec-state.cc:

PS4, Line 128: cause problem
> If we know exactly what sort of problems it could cause, it would be worth 
UpdateFragmentExecStatus() will decrease num_remaining_fragment_instances_ 
counter, call it multiple times for a completed fragment will cause DCHECK.
I'll add those to comments.


http://gerrit.cloudera.org:8080/#/c/3343/4/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

PS4, Line 39: assert "RPC" in str(ex)
> 'ex' should contain "RPC timed out" right? This would pass for "RPC Error" 
We used to capture only one specific timeout case, and try reopen client for 
all other errors. I am not sure that's correct way so I modified the exception 
handling. But now I see more RPC error case than before. Maybe I should roll 
back the exception handling change.


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-09 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3343/1/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

PS1, Line 140: DEFINE_int32(catalog_client_send_timeout_ms, 0,
> will use a single one and will test if this affect large catalog, or a tabl
We do need a significant larger timeout for catalog connection.
when execute ddl query, if table has many partitions, HMS API call could take 
very long time, e.g. 10K partition could take 100s+
for 100K, it could take an hour


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-09 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#4).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

Impala doesn't set socket send/recv timeout. RPC calls will wait
forever for data. In extreme case of network failure, or destination
host has kernel panic, sender will not get response and rpc call will
hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment()
or CancelPlanFragments(), query cannot be canelled unless you restart
coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread not quiting even after query is cancelled.
For catalog client, keep default timeout to 0. ExecDdl() could take
very long time if table has many partitons, mainly waiting for HMS
API call.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/runtime/plan-fragment-executor.h
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
9 files changed, 147 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/4
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-09 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#3).

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

Impala doesn't set socket send/recv timeout. RPC calls will wait
forever for data. In extreme case of network failure, or destination
host has kernel panic, sender will not get response and rpc call will
hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment()
or CancelPlanFragments(), query cannot be canelled unless you restart
coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread not quiting even after query is cancelled.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/runtime/plan-fragment-executor.h
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
9 files changed, 144 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/3
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout

2016-06-09 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3343

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
..

IMPALA-3575: Add retry to backend connection request and rpc timeout

Impala doesn't set socket send/recv timeout. RPC calls will wait
forever for data. In extreme case of network failure, or destination
host has kernel panic, sender will not get response and rpc call will
hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment()
or CancelPlanFragments(), query cannot be canelled unless you restart
coordinator.

Added send/recv timeout to all rpc calls to avoid query hang. And fix
a bug that reporting thread not quiting even after query is cancelled.

Besides the new EE test, I used the following iptable rule to
inject network failure to make sure rpc call never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packet to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
---
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/exec-env.cc
M be/src/runtime/plan-fragment-executor.cc
M be/src/runtime/plan-fragment-executor.h
M be/src/service/fragment-exec-state.cc
M be/src/statestore/statestore.cc
M common/thrift/generate_error_codes.py
8 files changed, 99 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>


[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

2016-04-28 Thread Juan Yu (Code Review)
Juan Yu has submitted this change and it was merged.

Change subject: IMPALA-3396: Fix ConcurrentTimerCounter unit test 
"TimerCounterTest" failure.
..


IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

On ec2 machine, the timer error could be a little larger, increase
MAX_TIMER_ERROR_NS to 15ms to avoid flaky test failure.

Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Reviewed-on: http://gerrit.cloudera.org:8080/2838
Reviewed-by: Juan Yu <j...@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit 7797033f8c627d5df6ba477b78dec5347b7e6caa)
Reviewed-on: http://gerrit.cloudera.org:8080/2898
Tested-by: Juan Yu <j...@cloudera.com>
---
M be/src/util/runtime-profile-test.cc
1 file changed, 10 insertions(+), 10 deletions(-)

Approvals:
  Juan Yu: Looks good to me, approved; Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/2898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.5.0_5.7.x
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>


[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

2016-04-28 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3396: Fix ConcurrentTimerCounter unit test 
"TimerCounterTest" failure.
..


Patch Set 1: Code-Review+2 Verified+1

Clean cherry-pick, verified by
http://sandbox.jenkins.cloudera.com/view/Impala/view/Private-Utility/job/impala-private-build-and-test/3002/

-- 
To view, visit http://gerrit.cloudera.org:8080/2898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.5.0_5.7.x
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-28 Thread Juan Yu (Code Review)
Juan Yu has submitted this change and it was merged.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..


IMPALA-2076: Correct execution time tracking for DataStreamSender.

DataStreamSender uses multiple channels (each channel uses one thread)
to send data. We want to know the active wall-clock network time used by
DataStreamSender no matter how many threads are used so we only count
concurrent sending time and row batch serialization time. We don't want
to count recv_TransmitData() time because it is mainly to wait for
receiver to deserialize data and process data, not using much network,
plus those part are already tracked by Exchange node.
Added a ConcurrentStopWatch class to get concurrent transmit time.
If a thread is already running, the following thread won't reset the
stop watch. And the stop watch is stopped only after all threads finish
their work.

Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Reviewed-on: http://gerrit.cloudera.org:8080/2578
Reviewed-by: Juan Yu <j...@cloudera.com>
Reviewed-by: Dan Hecht <dhe...@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit 18434bae383acb38c4a24e4a8d7484f6a173f354)
Reviewed-on: http://gerrit.cloudera.org:8080/2897
Tested-by: Juan Yu <j...@cloudera.com>
---
A be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/data-stream-sender.h
M be/src/runtime/data-stream-test.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-state.h
M be/src/service/fragment-exec-state.cc
M be/src/service/fragment-exec-state.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M be/src/util/stopwatch.h
17 files changed, 371 insertions(+), 44 deletions(-)

Approvals:
  Juan Yu: Looks good to me, approved; Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/2897
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.5.0_5.7.x
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>


[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-28 Thread Juan Yu (Code Review)
Hello Internal Jenkins, Dan Hecht,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/2897

to review the following change.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..

IMPALA-2076: Correct execution time tracking for DataStreamSender.

DataStreamSender uses multiple channels (each channel uses one thread)
to send data. We want to know the active wall-clock network time used by
DataStreamSender no matter how many threads are used so we only count
concurrent sending time and row batch serialization time. We don't want
to count recv_TransmitData() time because it is mainly to wait for
receiver to deserialize data and process data, not using much network,
plus those part are already tracked by Exchange node.
Added a ConcurrentStopWatch class to get concurrent transmit time.
If a thread is already running, the following thread won't reset the
stop watch. And the stop watch is stopped only after all threads finish
their work.

Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Reviewed-on: http://gerrit.cloudera.org:8080/2578
Reviewed-by: Juan Yu <j...@cloudera.com>
Reviewed-by: Dan Hecht <dhe...@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit 18434bae383acb38c4a24e4a8d7484f6a173f354)
---
A be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/data-stream-sender.h
M be/src/runtime/data-stream-test.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-state.h
M be/src/service/fragment-exec-state.cc
M be/src/service/fragment-exec-state.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M be/src/util/stopwatch.h
17 files changed, 371 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/97/2897/1
-- 
To view, visit http://gerrit.cloudera.org:8080/2897
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.5.0_5.7.x
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins


[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-28 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..


Patch Set 1: Code-Review+2 Verified+1

Clean cherry pick. verified by
http://sandbox.jenkins.cloudera.com/view/Impala/view/Private-Utility/job/impala-private-build-and-test/3002/

-- 
To view, visit http://gerrit.cloudera.org:8080/2897
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.5.0_5.7.x
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

2016-04-28 Thread Juan Yu (Code Review)
Hello Internal Jenkins,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/2898

to review the following change.

Change subject: IMPALA-3396: Fix ConcurrentTimerCounter unit test 
"TimerCounterTest" failure.
..

IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

On ec2 machine, the timer error could be a little larger, increase
MAX_TIMER_ERROR_NS to 15ms to avoid flaky test failure.

Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Reviewed-on: http://gerrit.cloudera.org:8080/2838
Reviewed-by: Juan Yu <j...@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit 7797033f8c627d5df6ba477b78dec5347b7e6caa)
---
M be/src/util/runtime-profile-test.cc
1 file changed, 10 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/98/2898/1
-- 
To view, visit http://gerrit.cloudera.org:8080/2898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.5.0_5.7.x
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins


Re: Impala query memory limit exceeded

2016-04-28 Thread Juan Yu
Besides adding join condition, could you also check scratch directory
permission?The memory consumption is much lower than mem limit when it
throw oom. I suspect spill fails.
On Apr 27, 2016 10:05 PM, "Pradeep Nayak"  wrote:

> I will see if it works with specifying the join condition.
>
> On Wed, Apr 27, 2016 at 2:28 PM Tim Armstrong 
> wrote:
>
>> You need to specify a join condition. I.e. table_a join table_b on
>> table_a.key = table_2.key
>>
>>   table_a join table_b computes the cross-product of the two tables.
>>
>> You can confirm by running explain:
>>
>>   explain select count(*) from tab3_test_new join tab4_test_new
>>
>>
>> It will show a plan like this with a "CROSS JOIN" in it.
>>
>> [tarmstrong-box.ca.cloudera.com:21000] > explain select * from
>> functional.alltypes join functional.alltypes t2;
>> Query: explain select * from functional.alltypes join functional.alltypes
>> t2
>> +---+
>> | Explain String|
>> +---+
>> | Estimated Per-Host Requirements: Memory=320.68MB VCores=2 |
>> |   |
>> | 04:EXCHANGE [UNPARTITIONED]   |
>> | | |
>> | 02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]   |
>> | | |
>> | |--03:EXCHANGE [BROADCAST]|
>> | |  |  |
>> | |  01:SCAN HDFS [functional.alltypes t2]  |
>> | | partitions=24/24 files=24 size=478.45KB |
>> | | |
>> | 00:SCAN HDFS [functional.alltypes]|
>> |partitions=24/24 files=24 size=478.45KB|
>> +---+
>>
>>
>> On Wed, Apr 27, 2016 at 2:21 PM, Pradeep Nayak 
>> wrote:
>>
>>> 40 CPU and 156GB RAM per nodemanager.
>>> We are trying to do some validation on Impala. We are using a 5 node
>>> hadoop
>>> cluster m4.10x large machines on the aws. These were setup with cloudera
>>> manager. Each node: 40 cpus' and 156GB ram.
>>>
>>> We are trying to do a count(*) on two tables with 14 million rows each.
>>> When do a count(*) individually on each of these nodes, then it works
>>> fine.
>>> However we try to do a join on these two tables and do a count(*) we hit
>>> the problem below saying memory limit exceeded.
>>>
>>> Any inputs here are appreciated ? Should we need a bigger cluster for
>>> this
>>> ? How do we decide the size of the cluster.
>>>
>>>
>>> 
>>>
>>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new;
>>>
>>> Query: select count(*) from tab3_test_new
>>>
>>> +--+
>>>
>>> | count(*) |
>>>
>>> +--+
>>>
>>> | 14216336 |
>>>
>>> +--+
>>>
>>> Fetched 1 row(s) in 8.29s
>>>
>>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab4_test_new;
>>>
>>> Query: select count(*) from tab4_test_new
>>>
>>> +--+
>>>
>>> | count(*) |
>>>
>>> +--+
>>>
>>> | 14987634 |
>>>
>>> +--+
>>>
>>> Fetched 1 row(s) in 11.46s
>>>
>>> [ip-172-30-1-57.ec2.internal:21000] >
>>>
>>> ip-172-30-1-57.ec2.internal:21000] > set mem_limit=64g;
>>>
>>> MEM_LIMIT set to 64g
>>>
>>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new
>>> join tab4_test_new;
>>>
>>> Query: select count(*) from tab3_test_new join tab4_test_new
>>>
>>> WARNINGS:
>>>
>>> Memory Limit Exceeded
>>>
>>> Query(b44dde4886ccafd2:cfb6b9ffcdc4e3a6) Limit: Limit=64.00 GB
>>> Consumption=80.77 MB
>>>
>>>   Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a7: Consumption=12.00 KB
>>>
>>> AGGREGATION_NODE (id=6): Consumption=4.00 KB
>>>
>>> EXCHANGE_NODE (id=5): Consumption=0
>>>
>>> DataStreamRecvr: Consumption=0
>>>
>>>   Block Manager: Limit=156.00 MB Consumption=0
>>>
>>>   Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a8: Consumption=64.62 MB
>>>
>>> AGGREGATION_NODE (id=3): Consumption=4.00 KB
>>>
>>> NESTED_LOOP_JOIN_NODE (id=2): Consumption=64.06 MB
>>>
>>> HDFS_SCAN_NODE (id=0): Consumption=0
>>>
>>> EXCHANGE_NODE (id=4): Consumption=0
>>>
>>> DataStreamRecvr: Consumption=544.00 KB
>>>
>>> DataStreamSender: Consumption=16.00 KB
>>>
>>>   Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a9: Consumption=16.14 MB
>>>
>>> HDFS_SCAN_NODE (id=1): Consumption=16.01 MB
>>>
>>> DataStreamSender: Consumption=128.00 KB
>>>
>>> WARNING: The following tables are missing relevant table and/or column
>>> statistics.
>>>
>>> test_db.tab3_test_new,test_db.tab4_test_new
>>>
>>>
>>>
>>> [ip-172-30-1-57.ec2.internal:21000] >
>>>
>>
>> --
>> You received this message because you are subscribed to the Google 

[Impala-CR](cdh5-trunk) IMPALA-2660: Respect auth_to_local configs from hdfs configs

2016-04-22 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2660: Respect auth_to_local configs from hdfs configs
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2800/3/fe/src/main/java/com/cloudera/impala/authorization/AuthorizationChecker.java
File 
fe/src/main/java/com/cloudera/impala/authorization/AuthorizationChecker.java:

Line 201: }
> I don't think so, we are re-throwing the wrapped exception inside catch(), 
You're right, at this point, there must be exception.


-- 
To view, visit http://gerrit.cloudera.org:8080/2800
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76485b83c14ba26f6fce66e5f83e8014667829e0
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

2016-04-22 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3396: Fix ConcurrentTimerCounter unit test 
"TimerCounterTest" failure.
..


Patch Set 4: Code-Review+2

Add JIRA number to commit message. carry Tim's +2

-- 
To view, visit http://gerrit.cloudera.org:8080/2838
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-2660: Respect auth_to_local configs from hdfs configs

2016-04-22 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2660: Respect auth_to_local configs from hdfs configs
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2800/3/fe/src/main/java/com/cloudera/impala/authorization/AuthorizationChecker.java
File 
fe/src/main/java/com/cloudera/impala/authorization/AuthorizationChecker.java:

Line 201: }
don't we need to return false at the end?


-- 
To view, visit http://gerrit.cloudera.org:8080/2800
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76485b83c14ba26f6fce66e5f83e8014667829e0
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

2016-04-22 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#3).

Change subject: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
..

Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

On ec2 machine, the timer error could be a little larger, increase
MAX_TIMER_ERROR_NS to 15ms to avoid flaky test failure.

Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
---
M be/src/util/runtime-profile-test.cc
1 file changed, 10 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/2838/3
-- 
To view, visit http://gerrit.cloudera.org:8080/2838
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>


[Impala-CR](cdh5-trunk) Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

2016-04-22 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2838/2/be/src/util/runtime-profile-test.cc
File be/src/util/runtime-profile-test.cc:

Line 614:   if (!(stopwatch_value >= expected_value - 
TimerCounterTest::MAX_TIMER_ERROR_NS &&
> If you use EXPECT_LE and EXPECT_GE to check the conditions it will log the 
How come I forget this. Thanks Tim.


-- 
To view, visit http://gerrit.cloudera.org:8080/2838
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

2016-04-21 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2838/2/be/src/util/runtime-profile-test.cc
File be/src/util/runtime-profile-test.cc:

Line 614:   if (!(stopwatch_value >= expected_value - 
TimerCounterTest::MAX_TIMER_ERROR_NS &&
Log the timer error if it is larger than the 
TimerCounterTest::MAX_TIMER_ERROR_NS so if the test fails again we could check 
if it's in valid range or not.


-- 
To view, visit http://gerrit.cloudera.org:8080/2838
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

2016-04-21 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#2).

Change subject: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
..

Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

On ec2 machine, the timer error could be a little larger, increase
MAX_TIMER_ERROR_NS to 15ms to avoid flaky test failure.

Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
---
M be/src/util/runtime-profile-test.cc
1 file changed, 12 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/2838/2
-- 
To view, visit http://gerrit.cloudera.org:8080/2838
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>


[Impala-CR](cdh5-trunk) Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

2016-04-21 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/2838

Change subject: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
..

Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.

On ec2 machine, the timer error could be a little larger, increase
MAX_TIMER_ERROR_NS to 15ms to avoid flaky test failure.

Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
---
M be/src/util/runtime-profile-test.cc
1 file changed, 10 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/2838/1
-- 
To view, visit http://gerrit.cloudera.org:8080/2838
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3382: Long warnings cause impala-shell to stall

2016-04-21 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3382: Long warnings cause impala-shell to stall
..


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/2821/1//COMMIT_MSG
Commit Message:

Line 7: stall
stale?

Could you add some explanation about why the long warning cause shell to stale?


http://gerrit.cloudera.org:8080/#/c/2821/1/shell/impala_client.py
File shell/impala_client.py:

Line 459: 1
use a constant for this


Line 460: return "WARNINGS: %s \n(truncated, please use 
--display_full_warning to display more)" % log[0:1]
long line


http://gerrit.cloudera.org:8080/#/c/2821/1/shell/option_parser.py
File shell/option_parser.py:

Line 111: store_true
We want the option default to false? right?


-- 
To view, visit http://gerrit.cloudera.org:8080/2821
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5d4ea7fc3e82f11151c170a9835ca9fcbbb3a94a
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Marcell Szabo <sz...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-20 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..


Patch Set 8: Code-Review+1

Carry previous +1

-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-20 Thread Juan Yu (Code Review)
Hello Henry Robinson,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2578

to look at the new patch set (#8).

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..

IMPALA-2076: Correct execution time tracking for DataStreamSender.

DataStreamSender uses multiple channels (each channel uses one thread)
to send data. We want to know the active wall-clock network time used by
DataStreamSender no matter how many threads are used so we only count
concurrent sending time and row batch serialization time. We don't want
to count recv_TransmitData() time because it is mainly to wait for
receiver to deserialize data and process data, not using much network,
plus those part are already tracked by Exchange node.
Added a ConcurrentStopWatch class to get concurrent transmit time.
If a thread is already running, the following thread won't reset the
stop watch. And the stop watch is stopped only after all threads finish
their work.

Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
---
A be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/data-stream-sender.h
M be/src/runtime/data-stream-test.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-state.h
M be/src/service/fragment-exec-state.cc
M be/src/service/fragment-exec-state.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M be/src/util/stopwatch.h
17 files changed, 371 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/8
-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-20 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/2578/6//COMMIT_MSG
Commit Message:

Line 10: time
> network time
Done


http://gerrit.cloudera.org:8080/#/c/2578/6/be/src/runtime/backend-client.h
File be/src/runtime/backend-client.h:

Line 46: recv_TransmitData
> The commit message wasn't clear to me, this is much clearer, thanks. How ab
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-20 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2578/6/be/src/runtime/backend-client.h
File be/src/runtime/backend-client.h:

Line 40: if (transmit_csw_ != NULL) {
> Oh yeah, I forgot we overload TransmitData with the eos.  Okay, then you ca
Yeah, we should add a separate one for eos, then the TransmitData will be just 
for network data sending and we should always track the time.


-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-20 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2578/6/be/src/runtime/backend-client.h
File be/src/runtime/backend-client.h:

Line 40: if (transmit_csw_ != NULL) {
> Do you mean the SendCurrentBatch() there?
FYI, the one in unit test is also for sending eos flag.
And it's for receiver test, not related to DataStreamSender at all.
Those ImpalaBackendClient thrift send/recv functions could be used for 
different purpose. I think we should let caller decide if they want to track 
time or not.


-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-20 Thread Juan Yu (Code Review)
Hello Henry Robinson,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2578

to look at the new patch set (#7).

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..

IMPALA-2076: Correct execution time tracking for DataStreamSender.

DataStreamSender uses multiple channels (each channel uses one thread)
to send data. We want to know the active wall-clock time used by
DataStreamSender no matter how many threads are used so we only count
concurrent sending time and row batch serialization time. We don't want
to count recv_TransmitData() time because it is mainly to wait for
receiver to deserialize data and process data, not using much network,
plus those part are already tracked by Exchange node.
Added a ConcurrentStopWatch class to get concurrent transmit time.
If a thread is already running, the following thread won't reset the
stop watch. And the stop watch is stopped only after all threads finish
their work.

Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
---
A be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/data-stream-sender.h
M be/src/runtime/data-stream-test.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-state.h
M be/src/service/fragment-exec-state.cc
M be/src/service/fragment-exec-state.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M be/src/util/stopwatch.h
17 files changed, 370 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/7
-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-20 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/2578/6/be/src/runtime/backend-client.h
File be/src/runtime/backend-client.h:

Line 40: if (transmit_csw_ != NULL) {
> Are there any cases where we want to do TransmitData() but not set the csw?
A couple of cases:
1. Status DataStreamSender::Channel::FlushAndSendEos(RuntimeState* state) 
this is not sending any data but just eos flag.
2. unit test, I could add a counter for data stream test.


Line 46: recv_TransmitData
> why don't we care about recv time?
I explained this earlier, also mentioned in commit message.
We only want to count the send_TransmitData() timing because that's the main 
job of DataStreamSender and that's where the network traffic spend. the 
recv_TransmitData() is mainly to wait for receiver to deserialize data and 
process data, not using much network, plus those part are already tracked by 
Exchange node.


-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-19 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#6).

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..

IMPALA-2076: Correct execution time tracking for DataStreamSender.

DataStreamSender uses multiple channels (each channel uses one thread)
to send data. We want to know the active wall-clock time used by
DataStreamSender no matter how many threads are used so ignore receiver
waiting time and only count concurrent sending time and batch
serialization time.
Added a ConcurrentStopWatch class to get concurrent transmit time.
If a thread is already running, the following thread won't reset the
stop watch. And the stop watch is stopped only after all threads finish
their work.

Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
---
A be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/data-stream-sender.h
M be/src/runtime/data-stream-test.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-state.h
M be/src/service/fragment-exec-state.cc
M be/src/service/fragment-exec-state.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M be/src/util/stopwatch.h
17 files changed, 370 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/6
-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-19 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..


Patch Set 5:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/runtime/backend-client.h
File be/src/runtime/backend-client.h:

Line 66: /// Common cache / connection types
> remove this comment (there's only one cache / connection type here etc.)
Done


http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/runtime/data-stream-test.cc
File be/src/runtime/data-stream-test.cc:

Line 630:   client.DoRpc(::TransmitData, params, 
);
> nit: one line?
Done


http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

Line 200: new ImpalaBackendClientCache(
> one line?
Done


http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/service/fragment-exec-state.cc
File be/src/service/fragment-exec-state.cc:

Line 118: coord.DoRpc(::ReportExecStatus, params, 
);
> nit: one line?
Done


http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/util/runtime-profile-test.cc
File be/src/util/runtime-profile-test.cc:

Line 541: : timercounter_(TUnit::TIME_NS) {}
> nit: one line?
Done


Line 548: NULL
> just for simplicity, why not assign this in the constructor? Then line 582 
Sorry I don't get it. do you mean 
workers_.push_back(DummyWorker(new thread(...)));
How do I pass DummyWorker to TimerCounterTest::Run()?


Line 605: 2000
> why did this change so much?
by comparing with wall clock, the diff should be under 1ms.
done


Line 605: MAX_TIMER_ERROR
> add units to the variable name (ns?)
Done


Line 622: TimerCounterTest& timer
> this should be a pointer if it's going to have non-const methods called on 
Done


http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/util/runtime-profile.h
File be/src/util/runtime-profile.h:

Line 382:   : Counter(unit) {}
> nit: one line?
Done


Line 392:  uint64_t LapTime() { return csw_.LapTime(); }
> needs a comment
Done


http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/util/stopwatch.h
File be/src/util/stopwatch.h:

Line 182: /// Utility class to measure multiple threads concurrent running time.
> I think it's important to say *wall* time here, otherwise users might expec
Done


Line 207: and reset lap_time_
:   /// back to 0.
> can remove this bit of the comment: it's an implementation detail (the beha
Done


Line 228:   /// Lock with busy_threads_.
> Nit: put the lock above the thing it protects (easiest thing is to swap msw
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-15 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#5).

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..

IMPALA-2076: Correct execution time tracking for DataStreamSender.

DataStreamSender uses multiple channels (each channel uses one thread)
to send data. We want to know the active wall-clock time used by
DataStreamSender no matter how many threads are used so ignore receiver
waiting time and only count concurrent sending time and batch
serialization time.
Added a ConcurrentStopWatch class to get concurrent transmit time.
If a thread is already running, the following thread won't reset the
stop watch. And the stop watch is stopped only after all threads finish
their work.

Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
---
A be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/data-stream-sender.h
M be/src/runtime/data-stream-test.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-state.h
M be/src/service/fragment-exec-state.cc
M be/src/service/fragment-exec-state.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M be/src/util/stopwatch.h
17 files changed, 376 insertions(+), 39 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/5
-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-15 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..


Patch Set 4:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/2578/4/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

Line 209:   COUNTER_ADD(parent_->profile_->total_time_counter(),
> I'm not totally sure if this does the right thing - presumably you want to 
I don't want to add the wall-clock time for all sender threads together. The 
DataStreamSender total_time = total serialize time + thrift_transmit_timer_. I 
want to add delta here to get Total_time updated frequently, the lap time is 
what I want. I'll cleanup the LapTime() implementation like you suggested.


http://gerrit.cloudera.org:8080/#/c/2578/4/be/src/util/runtime-profile-test.cc
File be/src/util/runtime-profile-test.cc:

Line 550:   delete thread_handle;
> how does this work if thread_handle is NULL (see line 584)?
Done


Line 557: while(!worker->done) {
> space before (
Done


Line 577:   void StopWorkers(int index) {
> needs a comment. maybe call 'index' something like 'exclude_idx' and defaul
Done


Line 580: index >= 0
> how would index be < 0?
I'll add comments. StopWorkers() can stop a specific thread, or all threads. -1 
means all threads.


Line 589:   static const int MAX_NUM_THREADS = 5;
> why have this, rather than just initialize workers_[] in StartWorkers?
Removed.


Line 608: void ValidateLapTime(TimerCounterTest& timer, int64_t expected_value) 
{
> missing const?
This one cannot be const, LapTime() will update ConcurrentStopWatch member 
variable.


Line 638: int running_time = 0;
> rather than tracking running_time per sleep-period, it might be better to u
Thanks for pointing this out, Done.


Line 659:   tester.StartWorkers(2, 0);
> suggest adding a sleep before this to show that the time between working pe
Done


http://gerrit.cloudera.org:8080/#/c/2578/4/be/src/util/runtime-profile.h
File be/src/util/runtime-profile.h:

Line 379: If a thread is already running, the following thread won't reset the 
stop watch.
> this is confusing - what's the following thread?
Done


http://gerrit.cloudera.org:8080/#/c/2578/4/be/src/util/stopwatch.h
File be/src/util/stopwatch.h:

Line 208: /// Returns running time since last time LapTime() is called and 
reset lap_time_
:   /// back to 0.
> this doesn't seem to match the implementation. If we call LapTime() twice i
Sorry for the confusion. I do need Lap time, for caller who wants delta update 
of concurrent running time. .


-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-14 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#4).

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..

IMPALA-2076: Correct execution time tracking for DataStreamSender.

DataStreamSender uses multiple channels (each channel uses one thread)
to send data. We want to know the active wall-clock time used by
DataStreamSender no matter how many threads are used so ignore receiver
waiting time and only count concurrent sending time and batch
serialization time.
Added a ConcurrentStopWatch class to get concurrent transmit time.
If a thread is already running, the following thread won't reset the
stop watch. And the stop watch is stopped only after all threads finish
their work.

Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
---
A be/src/runtime/backend-client.h
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/data-stream-sender.h
M be/src/runtime/data-stream-test.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-state.h
M be/src/service/fragment-exec-state.cc
M be/src/service/fragment-exec-state.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M be/src/util/stopwatch.h
17 files changed, 359 insertions(+), 39 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/4
-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.

2016-04-14 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time tracking for 
DataStreamSender.
..


Patch Set 3:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/2578/3/be/src/runtime/backend-client.h
File be/src/runtime/backend-client.h:

Line 15: #include "runtime/client-cache.h"
> Unused?
Defined ImpalaBackendClient in this header and keep this.


Line 22: /// Backend Client that can take a 
RuntimeProfile::ConcurrentTimerCounter to track
   : /// transmit data time for multiple threads. If more than one threads 
send data
   : /// at the same time, the transmit time is count only once.
> Let's make this a bit more clear, e.g.:
Done


Line 27:   ImpalaBackendClient(boost::shared_ptr< 
::apache::thrift::protocol::TProtocol> prot)
> is this space needed?
Thanks for pointing it out. yes, it's needed. without space, compiler will 
complain.
In file included from /home/cdh52/Impala/be/src/runtime/exec-env.h:26:
/home/cdh52/Impala/be/src/runtime/backend-client.h:30:40: error: found '<::' 
after a
  template name which forms the digraph '<:' (aka '[') and a ':', did you 
mean '< ::'?
  ImpalaBackendClient(boost::shared_ptr<::apache::thrift::protocol::TProtocol> 
prot)
   ^~~
   < ::


Line 30:   ImpalaBackendClient(boost::shared_ptr< 
::apache::thrift::protocol::TProtocol> iprot,
> blank line before this one
Done


Line 45: Caller who wants to track data transmit time need to set its own 
counter.
> Maybe: "Callers of TransmitData() should provide their own counter to measu
Done


Line 51:   /// ImpalaBackendClient is shared by multiple queries. It's caller's 
responsibility
> It's the caller's...
Done


Line 52:   /// to reset the counter after data transmition.
> transmission
Done


http://gerrit.cloudera.org:8080/#/c/2578/3/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 350: class ImpalaBackendClient;
: typedef ClientCache ImpalaBackendClientCache;
: typedef ClientConnection ImpalaBackendConnection;
> Let's move these to the header that defines ImpalaBackendClient (and then y
Done


http://gerrit.cloudera.org:8080/#/c/2578/3/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

Line 383: ThriftTransmitTime
> We use (*) to indicate that a timer is the sum of several counter's wall ti
Done


http://gerrit.cloudera.org:8080/#/c/2578/3/be/src/util/stopwatch.h
File be/src/util/stopwatch.h:

Line 187: last_running_time_(0),
:   busy_threads_(0) {
:   }
> fit this more onto one line
Done


Line 190: ~ConcurrentStopWatch() {
: msw_.Stop();
:   }
> I don't think you need this, because ConcurrentStopWatch isn't supposed to 
Done


Line 214: LastRunningTime
> I think we would usually call this the LapTime() (like on a regular stopwat
Done


Line 226: /// This is for caller who just want the most recent running time..
> Can you define what this is more clearly?
Done


Line 233:   boost::mutex thread_counter_lock_;
> I think this is a case where a SpinLock would make some sense, since the cr
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-13 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..


Patch Set 15: Code-Review+2

Carry Dan's +2

-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-13 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2543/14/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

Line 306: 8mb
> STREAM_OUT_BUF_SIZE
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 14
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-13 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..


Patch Set 14:

rebase, carry Dan's +2

-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 14
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-13 Thread Juan Yu (Code Review)
Hello Skye Wanderman-Milne, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2543

to look at the new patch set (#15).

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..

IMPALA-3038: Add multistream gzip/bzip2 test coverage

Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
---
M be/src/testutil/gtest-util.h
M be/src/util/codec.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
4 files changed, 123 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/15
-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-13 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..


Patch Set 14: Code-Review+1

Update comments and trivial change. Carry forward +1

-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 14
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-12 Thread Juan Yu (Code Review)
Hello Skye Wanderman-Milne,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2543

to look at the new patch set (#14).

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..

IMPALA-3038: Add multistream gzip/bzip2 test coverage

Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
---
M be/src/testutil/gtest-util.h
M be/src/util/codec.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
4 files changed, 122 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/14
-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 14
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-12 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..


Patch Set 13:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/2543/13/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

Line 187: Input is NULL or input_len is 0
> the code checks compressed_length, not 'input' nor 'input_len'.  should the
Done


Line 232: .
> let's explicitly say: "We should get expected_stream_end == false but with 
Done


Line 252: 8MB
> which constant does this correspond to? can we make the other constants der
This is STREAM_OUT_BUF_SIZE defined in decompress.cc. I can move it to codec.h 
to expose it.


Line 258: rand
> can the seed be specified at the command line (to reproduce failures)?
see L366, the seed is printed so we could reproduce failures.


Line 273: some space
> is the "some space" why we do the - RAW_INPUT_SIZE?  Or is the choice of RA
Yes, it's why we do - RAW_INPUT_SIZE. Done


Line 291: 16
> summarize how this is chosen.
Done


Line 292: 32
> and this
Done


Line 294: 2 * 26
> and this
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 13
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-12 Thread Juan Yu (Code Review)
Hello Skye Wanderman-Milne,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2543

to look at the new patch set (#13).

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..

IMPALA-3038: Add multistream gzip/bzip2 test coverage

Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
---
M be/src/testutil/gtest-util.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
3 files changed, 109 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/13
-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 13
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-12 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..


Patch Set 13: Code-Review+1

Carry Skye's +1

-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 13
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time for DataStreamSender.

2016-04-08 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time for DataStreamSender.
..


Patch Set 2:

@Henry, Thanks for reviewing and suggestion. I'll think about adding a new 
counter type for the acquire / release semantics from UpdateSenderStats() and 
try to get rid of the new DoRpc
We only want to count the send_TransmitData() timing because that's the main 
job of DataStreamSender and that's where the network traffic spend. the 
recv_TransmitData() is mainly to wait for receiver to deserialize data and 
process data, not using much network, plus those part are already tracked by 
Exchange node. I'll update commit message with more details.

-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: No


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-07 Thread Juan Yu (Code Review)
Hello Skye Wanderman-Milne,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2543

to look at the new patch set (#12).

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..

IMPALA-3038: Add multistream gzip/bzip2 test coverage

Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
---
M be/src/testutil/gtest-util.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
3 files changed, 109 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/12
-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 12
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-05 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2543/10/be/src/util/decompress.cc
File be/src/util/decompress.cc:

Line 95:   *stream_end = input_length == 0;
> This is for the special case that caller pass in a NULL input, only used in
On a second thought. since this is an output parameter, shouldn't decompressor 
always set it base on all info it has (in case if caller forget to initialize 
it)?
maybe set 
*stream_end = false;


-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-05 Thread Juan Yu (Code Review)
Hello Skye Wanderman-Milne,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/2543

to look at the new patch set (#11).

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..

IMPALA-3038: Add multistream gzip/bzip2 test coverage

Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
---
M be/src/testutil/gtest-util.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
3 files changed, 109 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/11
-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-04-05 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2543/10/be/src/util/decompress.cc
File be/src/util/decompress.cc:

Line 95:   *stream_end = input_length == 0;
> Are you sure this is always correct thing to do given the way HdfsTextScann
This is for the special case that caller pass in a NULL input, only used in 
test case. I don't think scanner will send NULL input to decompressor.
Maybe I should tweak the test case to not always check stream_end.


-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time for DataStreamSender.

2016-04-01 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time for DataStreamSender.
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2578/2/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 249:   /// Similar RPC call wrapper as DoRpc() but separate request and 
response so we could
:   /// profile thrift transmit time properly.
> This is just for TransmitData().
I also used tcpdump to monitor network traffic, same amount of packets are sent 
w/o the change.


-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage

2016-03-31 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#10).

Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage
..

IMPALA-3038: Add multistream gzip/bzip2 test coverage

Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
---
M be/src/testutil/gtest-util.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
3 files changed, 107 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/10
-- 
To view, visit http://gerrit.cloudera.org:8080/2543
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time for DataStreamSender.

2016-03-30 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: IMPALA-2076: Correct execution time for DataStreamSender.
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/2578/2/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

Line 249:   /// Similar RPC call wrapper as DoRpc() but separate request and 
response so we could
:   /// profile thrift transmit time properly.
> Are we okay with the increase in network traffic once we split this up?
This is just for TransmitData().
It is two calls currently. I don't think it increase network traffic.
void ImpalaInternalServiceClient::TransmitData(TTransmitDataResult& _return, 
const TTransmitDataParams& params)
{
  send_TransmitData(params);
  recv_TransmitData(_return);
}
The recv_TransmitData() is used to control back pressure. if upstream node 
cannot consume data as fast as needed, this will slowdown sender to avoid data 
congestion.


-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes


[Impala-CR](cdh5-trunk) PREVIEW IMPALA-2076: Part2 - Show DataStreamSender in ExecSummary under EXCHANGE node.

2016-03-25 Thread Juan Yu (Code Review)
Juan Yu has abandoned this change.

Change subject: PREVIEW IMPALA-2076: Part2 - Show DataStreamSender in 
ExecSummary under EXCHANGE node.
..


Abandoned

will take a different approach to add non-plan node profile to ExecSummary

-- 
To view, visit http://gerrit.cloudera.org:8080/2579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: Ia53206fdd02bd72294f86ab9dd0b71b58303d33e
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>


[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time for DataStreamSender.

2016-03-25 Thread Juan Yu (Code Review)
Juan Yu has uploaded a new patch set (#2).

Change subject: IMPALA-2076: Correct execution time for DataStreamSender.
..

IMPALA-2076: Correct execution time for DataStreamSender.

Ignore receiver waiting time and only count thrift sending time and
batch serialization time for DataStreamSender.

Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
---
M be/src/runtime/client-cache.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/data-stream-sender.h
3 files changed, 86 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/2
-- 
To view, visit http://gerrit.cloudera.org:8080/2578
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>


[Impala-CR](cdh5-2.2.0_5.4.x) CDH-38160: Prevent destructing uninitialized avro schema pointers

2016-03-25 Thread Juan Yu (Code Review)
Juan Yu has posted comments on this change.

Change subject: CDH-38160: Prevent destructing uninitialized avro schema 
pointers
..


Patch Set 5: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/2582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02ae0ba586887e4bb713b993edd208f9421dd1a8
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.2.0_5.4.x
Gerrit-Owner: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-HasComments: No


  1   2   >