[jira] [Created] (IMPALA-9123) Detect DDL hangs in tests
Quanlong Huang created IMPALA-9123: -- Summary: Detect DDL hangs in tests Key: IMPALA-9123 URL: https://issues.apache.org/jira/browse/IMPALA-9123 Project: IMPALA Issue Type: Test Reporter: Quanlong Huang Currently, we detect query hangs in tests by using execute_async() and wait_for_finished_timeout() of BeeswaxConnection together. E.g. [https://github.com/apache/impala/blob/3.3.0/tests/authorization/test_grant_revoke.py#L334-L335] {code:python} handle = self.client.execute_async("invalidate metadata") assert self.client.wait_for_finished_timeout(handle, timeout=60) {code} However, execute_async() won't return if the DDL is in CREATED state which is usually when DDL hangs. See the implementation of query() interface for Beeswax protocol: https://github.com/apache/impala/blob/3.3.0/be/src/service/impala-beeswax-server.cc#L52 So wait_for_finished_timeout() don't run and the test is stuck in execute_async(). We need to find a elegant way to detect DDL hangs and cancel the DDLs in CREATED state. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IMPALA-9124) Transparently retry queries that fail due to cluster membership changes
Sahil Takiar created IMPALA-9124: Summary: Transparently retry queries that fail due to cluster membership changes Key: IMPALA-9124 URL: https://issues.apache.org/jira/browse/IMPALA-9124 Project: IMPALA Issue Type: New Feature Components: Backend, Clients Reporter: Sahil Takiar Assignee: Sahil Takiar Currently, if the Impala Coordinator or any Executors run into errors during query execution, Impala will fail the entire query. It would improve user experience to transparently retry the query for some transient, recoverable errors. This JIRA focuses on retrying queries that would otherwise fail due to cluster membership changes. Specifically, node failures that cause changes in the cluster membership (currently the Coordinator cancels all queries running on a node if it detects that the node is no longer part of the cluster) and node blacklisting (the Coordinator blacklists a node because it detects a problem with that node - can’t execute RPCs against the node). It is not focused on retrying general errors (e.g. any frontend errors, MemLimitExceeded exceptions, etc.). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IMPALA-9125) Add general mechanism to find DataSink from other fragments
Tim Armstrong created IMPALA-9125: - Summary: Add general mechanism to find DataSink from other fragments Key: IMPALA-9125 URL: https://issues.apache.org/jira/browse/IMPALA-9125 Project: IMPALA Issue Type: Sub-task Components: Backend Reporter: Tim Armstrong Assignee: Tim Armstrong As a precursor to IMPALA-4224, we should add a mechanism for an finstance to discover the join build sink from another finstance. We already have a related single-purpose mechanism in the coordinator to find PlanRootSink. We should generalise this to allow looking up the sink of any other finstance and move it into QueryState -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IMPALA-9126) Cleanly separate build and probe state in hash join node
Tim Armstrong created IMPALA-9126: - Summary: Cleanly separate build and probe state in hash join node Key: IMPALA-9126 URL: https://issues.apache.org/jira/browse/IMPALA-9126 Project: IMPALA Issue Type: Sub-task Components: Backend Reporter: Tim Armstrong As a precursor to IMPALA-4224, we should clean up the hash join implementation so that the build and probe state is better separated. The builder should not deal with probe side data structures (like the probe streams that it allocates) and all accesses to the build-side data structures should go through as narrow APIs as possible. The nested loop join is already pretty clean. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IMPALA-9127) Clean up probe-side state machine in hash join
Tim Armstrong created IMPALA-9127: - Summary: Clean up probe-side state machine in hash join Key: IMPALA-9127 URL: https://issues.apache.org/jira/browse/IMPALA-9127 Project: IMPALA Issue Type: Sub-task Components: Backend Reporter: Tim Armstrong There's an implicit state machine in the main loop in PartitionedHashJoinNode::GetNext() https://github.com/apache/impala/blob/eea617b/be/src/exec/partitioned-hash-join-node.cc#L510 The state is implicitly defined based on the following conditions: * !output_build_partitions_.empty() -> "outputting build rows after probing" * builder_->null_aware_partition() == NULL -> "eos, because this the null-aware partition is processed after all other partitions" * null_probe_output_idx_ >= 0 -> "null probe rows being processed" * output_null_aware_probe_rows_running_ -> "null-aware partition being processed" * probe_batch_pos_ != -1 -> "processing probe batch" * builder_->num_hash_partitions() != 0 -> "have active hash partitions that are being probed" * spilled_partitions_.empty() -> "no more spilled partitions" I think this would be a lot easier to follow if the state machine was explicit and documented, and would make separating out the build side of a spilling hash join easier to get right. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IMPALA-9128) Improve debugging for slow sends in KrpcDataStreamSender
Tim Armstrong created IMPALA-9128: - Summary: Improve debugging for slow sends in KrpcDataStreamSender Key: IMPALA-9128 URL: https://issues.apache.org/jira/browse/IMPALA-9128 Project: IMPALA Issue Type: Bug Components: Distributed Exec Reporter: Tim Armstrong Assignee: Tim Armstrong I'm trying to debug a problem that appears to be caused by a slow RPC: {noformat} Fragment F00 Instance 754fc21ba4744310:d58fd0420020 (host=x) Hdfs split stats (:<# splits>/): 0:1/120.48 MB - AverageThreadTokens: 1.00 (1.0) - BloomFilterBytes: 0 B (0) - InactiveTotalTime: 0ns (0) - PeakMemoryUsage: 3.2 MiB (3337546) - PeakReservation: 2.0 MiB (2097152) - PeakUsedReservation: 0 B (0) - PerHostPeakMemUsage: 6.7 MiB (6987376) - RowsProduced: 7 (7) - TotalNetworkReceiveTime: 0ns (0) - TotalNetworkSendTime: 3.6m (215354065071) - TotalStorageWaitTime: 4ms (4552708) - TotalThreadsInvoluntaryContextSwitches: 2 (2) - TotalThreadsTotalWallClockTime: 3.6m (215924079474) - TotalThreadsSysTime: 24ms (24386000) - TotalThreadsUserTime: 505ms (505714000) - TotalThreadsVoluntaryContextSwitches: 3,623 (3623) - TotalTime: 3.6m (215801961705) Fragment Instance Lifecycle Event Timeline Prepare Finished: 1ms (1812344) Open Finished: 322ms (322905753) First Batch Produced: 447ms (447050377) First Batch Sent: 447ms (447054546) ExecInternal Finished: 3.6m (215802284852) Buffer pool - AllocTime: 0ns (0) - CumulativeAllocationBytes: 0 B (0) - CumulativeAllocations: 0 (0) - InactiveTotalTime: 0ns (0) - PeakReservation: 0 B (0) - PeakUnpinnedBytes: 0 B (0) - PeakUsedReservation: 0 B (0) - ReadIoBytes: 0 B (0) - ReadIoOps: 0 (0) - ReadIoWaitTime: 0ns (0) - ReservationLimit: 0 B (0) - TotalTime: 0ns (0) - WriteIoBytes: 0 B (0) - WriteIoOps: 0 (0) - WriteIoWaitTime: 0ns (0) Fragment Instance Lifecycle Timings - ExecTime: 3.6m (215479380267) - ExecTreeExecTime: 124ms (124299400) - InactiveTotalTime: 0ns (0) - OpenTime: 321ms (321088906) - ExecTreeOpenTime: 572.04us (572045) - PrepareTime: 1ms (1426412) - ExecTreePrepareTime: 233.32us (233318) - TotalTime: 0ns (0) KrpcDataStreamSender (dst_id=3) - EosSent: 58 (58) - InactiveTotalTime: 3.6m (215354085858) - PeakMemoryUsage: 464.4 KiB (475504) - RowsSent: 7 (7) - RpcFailure: 0 (0) - RpcRetry: 0 (0) - SerializeBatchTime: 99.87us (99867) - TotalBytesSent: 207 B (207) - TotalTime: 3.6m (215355336381) - UncompressedRowBatchSize: 267 B (267) {noformat} We should add some diagnostics that will allow us to figure out which RPCs are slow and whether there's a pattern about which host is the problem. E.g. maybe we should log if the RPC time exceeds a configured threshold. It may also be useful to include some stats about the wait time, e.g. a histogram of the wait times, so that we can see if it's an outlier or general slowness. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IMPALA-9129) Provide a way for negative tests to remove intentionally generated core dumps
David Knupp created IMPALA-9129: --- Summary: Provide a way for negative tests to remove intentionally generated core dumps Key: IMPALA-9129 URL: https://issues.apache.org/jira/browse/IMPALA-9129 Project: IMPALA Issue Type: Improvement Components: Infrastructure Reporter: David Knupp Occasionally, tests (esp. custom cluster tests) will perform some action, expecting Impala to generate a core dump. We should have a general way for such files to delete the bogus core dumps, otherwise they can complicate/confuse later test triaging efforts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (IMPALA-8692) Gracefully fail complex type inserts
[ https://issues.apache.org/jira/browse/IMPALA-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rawat resolved IMPALA-8692. Fix Version/s: Impala 3.4.0 Resolution: Fixed > Gracefully fail complex type inserts > > > Key: IMPALA-8692 > URL: https://issues.apache.org/jira/browse/IMPALA-8692 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Reporter: Abhishek Rawat >Assignee: Abhishek Rawat >Priority: Blocker > Labels: analysis, crash, front-end, parquet > Fix For: Impala 3.4.0 > > > Block such insert statement in analysis phase. -- This message was sent by Atlassian Jira (v8.3.4#803005)