[jira] [Created] (IMPALA-9123) Detect DDL hangs in tests

2019-11-05 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-9123:
--

 Summary: Detect DDL hangs in tests
 Key: IMPALA-9123
 URL: https://issues.apache.org/jira/browse/IMPALA-9123
 Project: IMPALA
  Issue Type: Test
Reporter: Quanlong Huang


Currently, we detect query hangs in tests by using execute_async() and 
wait_for_finished_timeout() of BeeswaxConnection together.

E.g. 
[https://github.com/apache/impala/blob/3.3.0/tests/authorization/test_grant_revoke.py#L334-L335]
{code:python}
handle = self.client.execute_async("invalidate metadata")
assert self.client.wait_for_finished_timeout(handle, timeout=60)
{code}

However, execute_async() won't return if the DDL is in CREATED state which is 
usually when DDL hangs. See the implementation of query() interface for Beeswax 
protocol: 
https://github.com/apache/impala/blob/3.3.0/be/src/service/impala-beeswax-server.cc#L52
So wait_for_finished_timeout() don't run and the test is stuck in 
execute_async().

We need to find a elegant way to detect DDL hangs and cancel the DDLs in 
CREATED state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IMPALA-9124) Transparently retry queries that fail due to cluster membership changes

2019-11-05 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9124:


 Summary: Transparently retry queries that fail due to cluster 
membership changes
 Key: IMPALA-9124
 URL: https://issues.apache.org/jira/browse/IMPALA-9124
 Project: IMPALA
  Issue Type: New Feature
  Components: Backend, Clients
Reporter: Sahil Takiar
Assignee: Sahil Takiar


Currently, if the Impala Coordinator or any Executors run into errors during 
query execution, Impala will fail the entire query. It would improve user 
experience to transparently retry the query for some transient, recoverable 
errors.

This JIRA focuses on retrying queries that would otherwise fail due to cluster 
membership changes. Specifically, node failures that cause changes in the 
cluster membership (currently the Coordinator cancels all queries running on a 
node if it detects that the node is no longer part of the cluster) and node 
blacklisting (the Coordinator blacklists a node because it detects a problem 
with that node - can’t execute RPCs against the node). It is not focused on 
retrying general errors (e.g. any frontend errors, MemLimitExceeded exceptions, 
etc.).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IMPALA-9125) Add general mechanism to find DataSink from other fragments

2019-11-05 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9125:
-

 Summary: Add general mechanism to find DataSink from other 
fragments
 Key: IMPALA-9125
 URL: https://issues.apache.org/jira/browse/IMPALA-9125
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Tim Armstrong
Assignee: Tim Armstrong


As a precursor to IMPALA-4224, we should add a mechanism for an finstance to 
discover the join build sink from another finstance. 

We already have a related single-purpose mechanism in the coordinator to find 
PlanRootSink. We should generalise this to allow looking up the sink of any 
other finstance and  move it into QueryState



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IMPALA-9126) Cleanly separate build and probe state in hash join node

2019-11-05 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9126:
-

 Summary: Cleanly separate build and probe state in hash join node
 Key: IMPALA-9126
 URL: https://issues.apache.org/jira/browse/IMPALA-9126
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Tim Armstrong


As a precursor to IMPALA-4224, we should clean up the hash join implementation 
so that the build and probe state is better separated. The builder should not 
deal with probe side data structures (like the probe streams that it allocates) 
and all accesses to the build-side data structures should go through as narrow 
APIs as possible.

The nested loop join is already pretty clean.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IMPALA-9127) Clean up probe-side state machine in hash join

2019-11-05 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9127:
-

 Summary: Clean up probe-side state machine in hash join
 Key: IMPALA-9127
 URL: https://issues.apache.org/jira/browse/IMPALA-9127
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Tim Armstrong


There's an implicit state machine in the main loop in  
PartitionedHashJoinNode::GetNext() 
https://github.com/apache/impala/blob/eea617b/be/src/exec/partitioned-hash-join-node.cc#L510

The state is implicitly defined based on the following conditions:
* !output_build_partitions_.empty() -> "outputting build rows after probing"
* builder_->null_aware_partition() == NULL -> "eos, because this the null-aware 
partition is processed after all other partitions"
* null_probe_output_idx_ >= 0 -> "null probe rows being processed"
* output_null_aware_probe_rows_running_ -> "null-aware partition being 
processed"
* probe_batch_pos_ != -1 -> "processing probe batch"
* builder_->num_hash_partitions() != 0 -> "have active hash partitions that are 
being probed"
* spilled_partitions_.empty() -> "no more spilled partitions"

I think this would be a lot easier to follow if the state machine was explicit 
and documented, and would make separating out the build side of a spilling hash 
join easier to get right.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IMPALA-9128) Improve debugging for slow sends in KrpcDataStreamSender

2019-11-05 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9128:
-

 Summary: Improve debugging for slow sends in KrpcDataStreamSender
 Key: IMPALA-9128
 URL: https://issues.apache.org/jira/browse/IMPALA-9128
 Project: IMPALA
  Issue Type: Bug
  Components: Distributed Exec
Reporter: Tim Armstrong
Assignee: Tim Armstrong


I'm trying to debug a problem that appears to be caused by a slow RPC:
{noformat}
Fragment F00
  Instance 754fc21ba4744310:d58fd0420020 (host=x)
Hdfs split stats (:<# splits>/): 0:1/120.48 
MB 
- AverageThreadTokens: 1.00 (1.0)
- BloomFilterBytes: 0 B (0)
- InactiveTotalTime: 0ns (0)
- PeakMemoryUsage: 3.2 MiB (3337546)
- PeakReservation: 2.0 MiB (2097152)
- PeakUsedReservation: 0 B (0)
- PerHostPeakMemUsage: 6.7 MiB (6987376)
- RowsProduced: 7 (7)
- TotalNetworkReceiveTime: 0ns (0)
- TotalNetworkSendTime: 3.6m (215354065071)
- TotalStorageWaitTime: 4ms (4552708)
- TotalThreadsInvoluntaryContextSwitches: 2 (2)
- TotalThreadsTotalWallClockTime: 3.6m (215924079474)
  - TotalThreadsSysTime: 24ms (24386000)
  - TotalThreadsUserTime: 505ms (505714000)
- TotalThreadsVoluntaryContextSwitches: 3,623 (3623)
- TotalTime: 3.6m (215801961705)
Fragment Instance Lifecycle Event Timeline
  Prepare Finished: 1ms (1812344)
  Open Finished: 322ms (322905753)
  First Batch Produced: 447ms (447050377)
  First Batch Sent: 447ms (447054546)
  ExecInternal Finished: 3.6m (215802284852)
Buffer pool
  - AllocTime: 0ns (0)
  - CumulativeAllocationBytes: 0 B (0)
  - CumulativeAllocations: 0 (0)
  - InactiveTotalTime: 0ns (0)
  - PeakReservation: 0 B (0)
  - PeakUnpinnedBytes: 0 B (0)
  - PeakUsedReservation: 0 B (0)
  - ReadIoBytes: 0 B (0)
  - ReadIoOps: 0 (0)
  - ReadIoWaitTime: 0ns (0)
  - ReservationLimit: 0 B (0)
  - TotalTime: 0ns (0)
  - WriteIoBytes: 0 B (0)
  - WriteIoOps: 0 (0)
  - WriteIoWaitTime: 0ns (0)
Fragment Instance Lifecycle Timings
  - ExecTime: 3.6m (215479380267)
- ExecTreeExecTime: 124ms (124299400)
  - InactiveTotalTime: 0ns (0)
  - OpenTime: 321ms (321088906)
- ExecTreeOpenTime: 572.04us (572045)
  - PrepareTime: 1ms (1426412)
- ExecTreePrepareTime: 233.32us (233318)
  - TotalTime: 0ns (0)
KrpcDataStreamSender (dst_id=3)
  - EosSent: 58 (58)
  - InactiveTotalTime: 3.6m (215354085858)
  - PeakMemoryUsage: 464.4 KiB (475504)
  - RowsSent: 7 (7)
  - RpcFailure: 0 (0)
  - RpcRetry: 0 (0)
  - SerializeBatchTime: 99.87us (99867)
  - TotalBytesSent: 207 B (207)
  - TotalTime: 3.6m (215355336381)
  - UncompressedRowBatchSize: 267 B (267)

{noformat}

We should add some diagnostics that will allow us to figure out which RPCs are 
slow and whether there's a pattern about which host is the problem. E.g. maybe 
we should log if the RPC time exceeds a configured threshold.

It may also be useful to include some stats about the wait time, e.g. a 
histogram of the wait times, so that we can see if it's an outlier or general 
slowness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IMPALA-9129) Provide a way for negative tests to remove intentionally generated core dumps

2019-11-05 Thread David Knupp (Jira)
David Knupp created IMPALA-9129:
---

 Summary: Provide a way for negative tests to remove intentionally 
generated core dumps
 Key: IMPALA-9129
 URL: https://issues.apache.org/jira/browse/IMPALA-9129
 Project: IMPALA
  Issue Type: Improvement
  Components: Infrastructure
Reporter: David Knupp


Occasionally, tests (esp. custom cluster tests) will perform some action, 
expecting Impala to generate a core dump.

We should have a general way for such files to delete the bogus core dumps, 
otherwise they can complicate/confuse later test triaging efforts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (IMPALA-8692) Gracefully fail complex type inserts

2019-11-05 Thread Abhishek Rawat (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Rawat resolved IMPALA-8692.

Fix Version/s: Impala 3.4.0
   Resolution: Fixed

> Gracefully fail complex type inserts
> 
>
> Key: IMPALA-8692
> URL: https://issues.apache.org/jira/browse/IMPALA-8692
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Reporter: Abhishek Rawat
>Assignee: Abhishek Rawat
>Priority: Blocker
>  Labels: analysis, crash, front-end, parquet
> Fix For: Impala 3.4.0
>
>
> Block such insert statement in analysis phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)