[jira] [Created] (IMPALA-13264) bin/coverage_helper.sh should always use gcov from the toolchain
Joe McDonnell created IMPALA-13264: -- Summary: bin/coverage_helper.sh should always use gcov from the toolchain Key: IMPALA-13264 URL: https://issues.apache.org/jira/browse/IMPALA-13264 Project: IMPALA Issue Type: Task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell bin/coverage_helper.sh gets gcov from the toolchain if it is not installed on the system. {noformat} if ! which gcov > /dev/null; then export PATH="$PATH:$IMPALA_TOOLCHAIN_PACKAGES_HOME/gcc-$IMPALA_GCC_VERSION/bin" fi echo "Using gcov at `which gcov`"{noformat} Since the toolchain compiler can be different from the system compiler, I think it makes more sense to always use gcov from the toolchain's GCC. Then the gcov version will always match the GCC version. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13264) bin/coverage_helper.sh should always use gcov from the toolchain
Joe McDonnell created IMPALA-13264: -- Summary: bin/coverage_helper.sh should always use gcov from the toolchain Key: IMPALA-13264 URL: https://issues.apache.org/jira/browse/IMPALA-13264 Project: IMPALA Issue Type: Task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell bin/coverage_helper.sh gets gcov from the toolchain if it is not installed on the system. {noformat} if ! which gcov > /dev/null; then export PATH="$PATH:$IMPALA_TOOLCHAIN_PACKAGES_HOME/gcc-$IMPALA_GCC_VERSION/bin" fi echo "Using gcov at `which gcov`"{noformat} Since the toolchain compiler can be different from the system compiler, I think it makes more sense to always use gcov from the toolchain's GCC. Then the gcov version will always match the GCC version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-13253) Add option to use TCP keepalives for client connections
[ https://issues.apache.org/jira/browse/IMPALA-13253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17869039#comment-17869039 ] Joe McDonnell commented on IMPALA-13253: The AWS LB has an idle time limit of 350 seconds that does not explicitly notify either end that the connection is dead: [https://aws.amazon.com/blogs/networking-and-content-delivery/introducing-configurable-idle-timeout-for-connection-tracking/] The libkeepalive library can be used to force a program to use TCP keepalive without needing to recompile it: [https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/addsupport.html] Testing using libkeepalive and iptables shows that it behaves as expected: It can handle situations where packets are dropped or rejected. In a cluster that uses the AWS LB, this can be set to have a keepalive time of 400 seconds to very quickly detect and close connections that AWS LB considers idle. I think keepalive should be on by default. > Add option to use TCP keepalives for client connections > --- > > Key: IMPALA-13253 > URL: https://issues.apache.org/jira/browse/IMPALA-13253 > Project: IMPALA > Issue Type: Task > Components: Backend, Clients >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Priority: Blocker > > A client can be disconnected without explicitly closing its TCP connection. > This can happen if the client machine resets or there is a network > disruption. In particular, load balancers can have an idle time that results > in a connection becoming invalid. Impala can't really guarantee that the > client will properly tear down its connection and the Impala side resources > will be released. > TCP keepalive would allow Impala to detect dead clients and close the > connection. It also can prevent a load balancer from seeing the connection as > idle. This can be important for clients that hold connections in a pool. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13253) Add option to use TCP keepalives for client connections
[ https://issues.apache.org/jira/browse/IMPALA-13253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell updated IMPALA-13253: --- Priority: Blocker (was: Critical) > Add option to use TCP keepalives for client connections > --- > > Key: IMPALA-13253 > URL: https://issues.apache.org/jira/browse/IMPALA-13253 > Project: IMPALA > Issue Type: Task > Components: Backend, Clients >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Priority: Blocker > > A client can be disconnected without explicitly closing its TCP connection. > This can happen if the client machine resets or there is a network > disruption. In particular, load balancers can have an idle time that results > in a connection becoming invalid. Impala can't really guarantee that the > client will properly tear down its connection and the Impala side resources > will be released. > TCP keepalive would allow Impala to detect dead clients and close the > connection. It also can prevent a load balancer from seeing the connection as > idle. This can be important for clients that hold connections in a pool. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-13202) Impala workloads can exceed Kudu client's rpc_max_message_size limit
[ https://issues.apache.org/jira/browse/IMPALA-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868423#comment-17868423 ] Joe McDonnell commented on IMPALA-13202: I filed https://issues.apache.org/jira/browse/KUDU-3595 for the Kudu-side change. This Jira will track the Impala side change to pick up a new Kudu and add a startup parameter to set it. > Impala workloads can exceed Kudu client's rpc_max_message_size limit > > > Key: IMPALA-13202 > URL: https://issues.apache.org/jira/browse/IMPALA-13202 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Quanlong Huang >Priority: Critical > Attachments: data.parquet > > > The way Impala integrates with KRPC is porting the KRPC codes into the Impala > code base. Flags and methods of KRPC are defined as GLOBAL in the impalad > executable. libkudu_client.so also compiles from the same KRPC codes and have > duplicate flags and methods defined as HIDDEN. > To be specifit, both the impalad executable and libkudu_client.so have the > symbol for kudu::rpc::InboundTransfer::ReceiveBuffer() > {noformat} > $ readelf -s --wide be/build/latest/service/impalad | grep ReceiveBuffer > 8: 022f5c88 1936 FUNCGLOBAL DEFAULT 13 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > 81380: 022f5c88 1936 FUNCGLOBAL DEFAULT 13 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > $ readelf -s --wide > toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so > | grep ReceiveBuffer > 1601: 00086e4a 108 FUNCLOCAL DEFAULT 12 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE.cold > 11905: 001fec60 2076 FUNCLOCAL HIDDEN12 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > $ c++filt > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > kudu::rpc::InboundTransfer::ReceiveBuffer(kudu::Socket*, kudu::faststring*) > {noformat} > KRPC flags like rpc_max_message_size are also defined in both the impalad > executable and libkudu_client.so: > {noformat} > $ readelf -s --wide be/build/latest/service/impalad | grep > FLAGS_rpc_max_message_size > 14380: 06006738 8 OBJECT GLOBAL DEFAULT 30 > _ZN5fLI6426FLAGS_rpc_max_message_sizeE > 80396: 06006741 1 OBJECT GLOBAL DEFAULT 30 > _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE > 81399: 06006741 1 OBJECT GLOBAL DEFAULT 30 > _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE > 117873: 06006738 8 OBJECT GLOBAL DEFAULT 30 > _ZN5fLI6426FLAGS_rpc_max_message_sizeE > $ readelf -s --wide > toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so > | grep FLAGS_rpc_max_message_size > 11882: 008d61e1 1 OBJECT LOCAL HIDDEN27 > _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE > 11906: 008d61d8 8 OBJECT LOCAL DEFAULT 27 > _ZN5fLI6426FLAGS_rpc_max_message_sizeE > $ c++filt _ZN5fLI6426FLAGS_rpc_max_message_sizeE > fLI64::FLAGS_rpc_max_message_size {noformat} > libkudu_client.so uses its own methods and flags. The flags are HIDDEN so > can't be modified by Impala codes. E.g. IMPALA-4874 bumps > FLAGS_rpc_max_message_size to 2GB in RpcMgr::Init(), but the HIDDEN variable > FLAGS_rpc_max_message_size used in libkudu_client.so is still the default > value 50MB (52428800). We've seen error messages like this in the master > branch: > {code:java} > I0708 10:23:31.784974 2943 meta_cache.cc:294] > c243bda4702a5ab9:0ba93d240001] tablet 0c8f3446538449ee9d3df5056afe775e: > replica e0e1db54dab74f208e37ea1b975595e5 (127.0.0.1:31202) has failed: > Network error: TS failed: RPC frame had a length of 53477464, but we only > support messages up to 52428800 bytes long.{code} > CC [~joemcdonnell] [~wzhou] [~aserbin] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-13183) Add default timeout for hs2/beeswax server sockets
[ https://issues.apache.org/jira/browse/IMPALA-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868420#comment-17868420 ] Joe McDonnell commented on IMPALA-13183: Here is an AWS blog post about how the AWS LB works: [https://aws.amazon.com/blogs/networking-and-content-delivery/introducing-configurable-idle-timeout-for-connection-tracking/] The section about "Scenario #1: TCP connections through AWS Services" explains that it doesn't send packets when a connection goes idle. An endpoint would only find out when it sends a message. I think this is a problem for Impala, and having an idle connection timeout would be one way to avoid issues. > Add default timeout for hs2/beeswax server sockets > -- > > Key: IMPALA-13183 > URL: https://issues.apache.org/jira/browse/IMPALA-13183 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Priority: Major > > Currently Impala only sets timeout for specific operations, for example > during SASL handshake and when checking if connection can be closed due to > idle session. > https://github.com/apache/impala/blob/d39596f6fb7da54c24d02523c4691e6b1973857b/be/src/rpc/TAcceptQueueServer.cpp#L153 > https://github.com/apache/impala/blob/d39596f6fb7da54c24d02523c4691e6b1973857b/be/src/transport/TSaslServerTransport.cpp#L145 > There are several cases where an inactive client could keep the connection > open indefinitely, for example if it hasn't opened a session yet. > I think that there should be a general longer timeout set for both send/recv, > e.g. flag client_default_timout_s=3600. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13202) Impala workloads can exceed Kudu client's rpc_max_message_size limit
[ https://issues.apache.org/jira/browse/IMPALA-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell updated IMPALA-13202: --- Summary: Impala workloads can exceed Kudu client's rpc_max_message_size limit (was: KRPC flags used by libkudu_client.so can't be configured) > Impala workloads can exceed Kudu client's rpc_max_message_size limit > > > Key: IMPALA-13202 > URL: https://issues.apache.org/jira/browse/IMPALA-13202 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Quanlong Huang >Priority: Critical > Attachments: data.parquet > > > The way Impala integrates with KRPC is porting the KRPC codes into the Impala > code base. Flags and methods of KRPC are defined as GLOBAL in the impalad > executable. libkudu_client.so also compiles from the same KRPC codes and have > duplicate flags and methods defined as HIDDEN. > To be specifit, both the impalad executable and libkudu_client.so have the > symbol for kudu::rpc::InboundTransfer::ReceiveBuffer() > {noformat} > $ readelf -s --wide be/build/latest/service/impalad | grep ReceiveBuffer > 8: 022f5c88 1936 FUNCGLOBAL DEFAULT 13 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > 81380: 022f5c88 1936 FUNCGLOBAL DEFAULT 13 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > $ readelf -s --wide > toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so > | grep ReceiveBuffer > 1601: 00086e4a 108 FUNCLOCAL DEFAULT 12 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE.cold > 11905: 001fec60 2076 FUNCLOCAL HIDDEN12 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > $ c++filt > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > kudu::rpc::InboundTransfer::ReceiveBuffer(kudu::Socket*, kudu::faststring*) > {noformat} > KRPC flags like rpc_max_message_size are also defined in both the impalad > executable and libkudu_client.so: > {noformat} > $ readelf -s --wide be/build/latest/service/impalad | grep > FLAGS_rpc_max_message_size > 14380: 06006738 8 OBJECT GLOBAL DEFAULT 30 > _ZN5fLI6426FLAGS_rpc_max_message_sizeE > 80396: 06006741 1 OBJECT GLOBAL DEFAULT 30 > _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE > 81399: 06006741 1 OBJECT GLOBAL DEFAULT 30 > _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE > 117873: 06006738 8 OBJECT GLOBAL DEFAULT 30 > _ZN5fLI6426FLAGS_rpc_max_message_sizeE > $ readelf -s --wide > toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so > | grep FLAGS_rpc_max_message_size > 11882: 008d61e1 1 OBJECT LOCAL HIDDEN27 > _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE > 11906: 008d61d8 8 OBJECT LOCAL DEFAULT 27 > _ZN5fLI6426FLAGS_rpc_max_message_sizeE > $ c++filt _ZN5fLI6426FLAGS_rpc_max_message_sizeE > fLI64::FLAGS_rpc_max_message_size {noformat} > libkudu_client.so uses its own methods and flags. The flags are HIDDEN so > can't be modified by Impala codes. E.g. IMPALA-4874 bumps > FLAGS_rpc_max_message_size to 2GB in RpcMgr::Init(), but the HIDDEN variable > FLAGS_rpc_max_message_size used in libkudu_client.so is still the default > value 50MB (52428800). We've seen error messages like this in the master > branch: > {code:java} > I0708 10:23:31.784974 2943 meta_cache.cc:294] > c243bda4702a5ab9:0ba93d240001] tablet 0c8f3446538449ee9d3df5056afe775e: > replica e0e1db54dab74f208e37ea1b975595e5 (127.0.0.1:31202) has failed: > Network error: TS failed: RPC frame had a length of 53477464, but we only > support messages up to 52428800 bytes long.{code} > CC [~joemcdonnell] [~wzhou] [~aserbin] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-13183) Add default timeout for hs2/beeswax server sockets
[ https://issues.apache.org/jira/browse/IMPALA-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868178#comment-17868178 ] Joe McDonnell commented on IMPALA-13183: I was just about to file a Jira about having functionality to close idle connections. This sounds similar, so I'm commenting here. We can split it off if it is not quite the same. Basically, there is no current mechanism to close idle connections that have no session. There are circumstances where Hue and other clients that use a connection pool can create these sessions. For example, Hue might want to close a query that was executed by a different connection. It opens a connection using the existing session, then when it tries to close the query/session, it finds out that the query/session was already closed. This connection ends up with no associated session and can stay that way for an indefinite period of time. We have seen cases where these connections can stay open on the server side even after the client tries to close it. That seems to be happening with certain load balancers, and it can cause the server to run out of fe service threads. > Add default timeout for hs2/beeswax server sockets > -- > > Key: IMPALA-13183 > URL: https://issues.apache.org/jira/browse/IMPALA-13183 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Priority: Major > > Currently Impala only sets timeout for specific operations, for example > during SASL handshake and when checking if connection can be closed due to > idle session. > https://github.com/apache/impala/blob/d39596f6fb7da54c24d02523c4691e6b1973857b/be/src/rpc/TAcceptQueueServer.cpp#L153 > https://github.com/apache/impala/blob/d39596f6fb7da54c24d02523c4691e6b1973857b/be/src/transport/TSaslServerTransport.cpp#L145 > There are several cases where an inactive client could keep the connection > open indefinitely, for example if it hasn't opened a session yet. > I think that there should be a general longer timeout set for both send/recv, > e.g. flag client_default_timout_s=3600. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13253) Add option to use TCP keepalives for client connections
Joe McDonnell created IMPALA-13253: -- Summary: Add option to use TCP keepalives for client connections Key: IMPALA-13253 URL: https://issues.apache.org/jira/browse/IMPALA-13253 Project: IMPALA Issue Type: Task Components: Backend, Clients Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell A client can be disconnected without explicitly closing its TCP connection. This can happen if the client machine resets or there is a network disruption. In particular, load balancers can have an idle time that results in a connection becoming invalid. Impala can't really guarantee that the client will properly tear down its connection and the Impala side resources will be released. TCP keepalive would allow Impala to detect dead clients and close the connection. It also can prevent a load balancer from seeing the connection as idle. This can be important for clients that hold connections in a pool. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13253) Add option to use TCP keepalives for client connections
Joe McDonnell created IMPALA-13253: -- Summary: Add option to use TCP keepalives for client connections Key: IMPALA-13253 URL: https://issues.apache.org/jira/browse/IMPALA-13253 Project: IMPALA Issue Type: Task Components: Backend, Clients Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell A client can be disconnected without explicitly closing its TCP connection. This can happen if the client machine resets or there is a network disruption. In particular, load balancers can have an idle time that results in a connection becoming invalid. Impala can't really guarantee that the client will properly tear down its connection and the Impala side resources will be released. TCP keepalive would allow Impala to detect dead clients and close the connection. It also can prevent a load balancer from seeing the connection as idle. This can be important for clients that hold connections in a pool. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IMPALA-13230) Add a way to dump stack traces for impala-shell while it is running
[ https://issues.apache.org/jira/browse/IMPALA-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866577#comment-17866577 ] Joe McDonnell commented on IMPALA-13230: Example stack trace while running a query: {noformat} File "shell/build/python3_venv/bin/impala-shell", line 8, in sys.exit(impala_shell_main()) File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/impala_shell/impala_shell.py", line 2305, in impala_shell_main shell.cmdloop(intro) File "/usr/lib/python3.8/cmd.py", line 138, in cmdloop stop = self.onecmd(line) File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/impala_shell/impala_shell.py", line 788, in onecmd return func(arg) File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/impala_shell/impala_shell.py", line 1239, in do_select return self._execute_stmt(query_str, print_web_link=True) File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/impala_shell/impala_shell.py", line 1426, in _execute_stmt for rows in rows_fetched: File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/impala_shell/impala_client.py", line 926, in fetch resp = self._do_hs2_rpc(FetchResults, req) File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/impala_shell/impala_client.py", line 1148, in _do_hs2_rpc rpc_output = rpc(rpc_input) File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/impala_shell/impala_client.py", line 920, in FetchResults return self.imp_service.FetchResults(req) File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/impala_shell/TCLIService/TCLIService.py", line 756, in FetchResults return self.recv_FetchResults() File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/impala_shell/TCLIService/TCLIService.py", line 768, in recv_FetchResults (fname, mtype, rseqid) = iprot.readMessageBegin() File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/thrift/protocol/TBinaryProtocol.py", line 134, in readMessageBegin sz = self.readI32() File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/thrift/protocol/TBinaryProtocol.py", line 217, in readI32 buff = self.trans.readAll(4) File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/thrift/transport/TTransport.py", line 62, in readAll chunk = self.read(sz - have) File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/thrift/transport/TTransport.py", line 164, in read self.__rbuf = BufferIO(self.__trans.read(max(sz, self.__rbuf_size))) File "/home/joemcdonnell/upstream/Impala/shell/build/python3_venv/lib/python3.8/site-packages/thrift/transport/TSocket.py", line 150, in read buff = self.handle.recv(sz) {noformat} > Add a way to dump stack traces for impala-shell while it is running > --- > > Key: IMPALA-13230 > URL: https://issues.apache.org/jira/browse/IMPALA-13230 > Project: IMPALA > Issue Type: Task > Components: Clients >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Priority: Major > > It can be useful to get the Python stack traces for impala-shell when it is > stuck. There is a nice thread on Stack Overflow about how to do this: > [https://stackoverflow.com/questions/132058/showing-the-stack-trace-from-a-running-python-application] > One option is to install a signal handler for the SIGUSR1 signal and use that > to dump a backtrace. I tried this and it works for Python 3 (but causes > issues for running queries on Python 2): > {noformat} > # For debugging, it is useful to handle the SIGUSR1 symbol and use it to > print a > # stacktrace > signal.signal(signal.SIGUSR1, lambda sid, stack: > traceback.print_stack(stack)){noformat} > Another option mentioned is the faulthandler module > ([https://docs.python.org/dev/library/faulthandler.html|https://docs.python.org/dev/library/faulthandler.html)] > ), which provides a way to do the same thing. The faulthandler module seems > to be able to do this for all threads, not just the main thread. > Either way, this would give us some options if we need to debug impala-shell > out in the wild. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail:
[jira] [Created] (IMPALA-13230) Add a way to dump stack traces for impala-shell while it is running
Joe McDonnell created IMPALA-13230: -- Summary: Add a way to dump stack traces for impala-shell while it is running Key: IMPALA-13230 URL: https://issues.apache.org/jira/browse/IMPALA-13230 Project: IMPALA Issue Type: Task Components: Clients Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell It can be useful to get the Python stack traces for impala-shell when it is stuck. There is a nice thread on Stack Overflow about how to do this: [https://stackoverflow.com/questions/132058/showing-the-stack-trace-from-a-running-python-application] One option is to install a signal handler for the SIGUSR1 signal and use that to dump a backtrace. I tried this and it works for Python 3 (but causes issues for running queries on Python 2): {noformat} # For debugging, it is useful to handle the SIGUSR1 symbol and use it to print a # stacktrace signal.signal(signal.SIGUSR1, lambda sid, stack: traceback.print_stack(stack)){noformat} Another option mentioned is the faulthandler module ([https://docs.python.org/dev/library/faulthandler.html|https://docs.python.org/dev/library/faulthandler.html)] ), which provides a way to do the same thing. The faulthandler module seems to be able to do this for all threads, not just the main thread. Either way, this would give us some options if we need to debug impala-shell out in the wild. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13230) Add a way to dump stack traces for impala-shell while it is running
Joe McDonnell created IMPALA-13230: -- Summary: Add a way to dump stack traces for impala-shell while it is running Key: IMPALA-13230 URL: https://issues.apache.org/jira/browse/IMPALA-13230 Project: IMPALA Issue Type: Task Components: Clients Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell It can be useful to get the Python stack traces for impala-shell when it is stuck. There is a nice thread on Stack Overflow about how to do this: [https://stackoverflow.com/questions/132058/showing-the-stack-trace-from-a-running-python-application] One option is to install a signal handler for the SIGUSR1 signal and use that to dump a backtrace. I tried this and it works for Python 3 (but causes issues for running queries on Python 2): {noformat} # For debugging, it is useful to handle the SIGUSR1 symbol and use it to print a # stacktrace signal.signal(signal.SIGUSR1, lambda sid, stack: traceback.print_stack(stack)){noformat} Another option mentioned is the faulthandler module ([https://docs.python.org/dev/library/faulthandler.html|https://docs.python.org/dev/library/faulthandler.html)] ), which provides a way to do the same thing. The faulthandler module seems to be able to do this for all threads, not just the main thread. Either way, this would give us some options if we need to debug impala-shell out in the wild. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13229) Improve logging for TAcceptQueueServer when a thread takes a long time in SASL negotiation
Joe McDonnell created IMPALA-13229: -- Summary: Improve logging for TAcceptQueueServer when a thread takes a long time in SASL negotiation Key: IMPALA-13229 URL: https://issues.apache.org/jira/browse/IMPALA-13229 Project: IMPALA Issue Type: Task Components: Backend Affects Versions: Impala 4.4.0 Reporter: Joe McDonnell In IMPALA-11653, we are concerned about bad clients that use up threads in the SASL negotiation thread pool for long periods of time (or eventually hit sasl_connect_tcp_timeout_ms). As a separate task, it would be useful to be able to quickly tell from the logs whether a connection spends a lot of time in the SASL negotiation and could be creating this type of problem. We should add some logging to make this issue clear from the logs. One option is to log a warning if SASL negotiation takes longer than some threshold (and thus was using up a thread during that time). If SASL negotiation is taking longer than a few seconds, that can be a real issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13229) Improve logging for TAcceptQueueServer when a thread takes a long time in SASL negotiation
Joe McDonnell created IMPALA-13229: -- Summary: Improve logging for TAcceptQueueServer when a thread takes a long time in SASL negotiation Key: IMPALA-13229 URL: https://issues.apache.org/jira/browse/IMPALA-13229 Project: IMPALA Issue Type: Task Components: Backend Affects Versions: Impala 4.4.0 Reporter: Joe McDonnell In IMPALA-11653, we are concerned about bad clients that use up threads in the SASL negotiation thread pool for long periods of time (or eventually hit sasl_connect_tcp_timeout_ms). As a separate task, it would be useful to be able to quickly tell from the logs whether a connection spends a lot of time in the SASL negotiation and could be creating this type of problem. We should add some logging to make this issue clear from the logs. One option is to log a warning if SASL negotiation takes longer than some threshold (and thus was using up a thread during that time). If SASL negotiation is taking longer than a few seconds, that can be a real issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-13202) KRPC flags used by libkudu_client.so can't be configured
[ https://issues.apache.org/jira/browse/IMPALA-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866157#comment-17866157 ] Joe McDonnell commented on IMPALA-13202: It seems like one path would be for Kudu client to add this as a configuration in KuduClientBuilder and then Impala could specify the value there. That is how we usually pass in configuration parameters for the Kudu client. See [https://github.com/apache/impala/blob/master/be/src/exec/kudu/kudu-util.cc#L85-L104] . I think it is good to have these things as part of the client API rather than setting global variables. I think it is good that Kudu client's flags are hidden and can't be set. My understanding is that Impala's rpc_max_message_size parameter was intended to apply for Impala to Impala communication, not Impala to Kudu communication. > KRPC flags used by libkudu_client.so can't be configured > > > Key: IMPALA-13202 > URL: https://issues.apache.org/jira/browse/IMPALA-13202 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Quanlong Huang >Priority: Critical > Attachments: data.parquet > > > The way Impala integrates with KRPC is porting the KRPC codes into the Impala > code base. Flags and methods of KRPC are defined as GLOBAL in the impalad > executable. libkudu_client.so also compiles from the same KRPC codes and have > duplicate flags and methods defined as HIDDEN. > To be specifit, both the impalad executable and libkudu_client.so have the > symbol for kudu::rpc::InboundTransfer::ReceiveBuffer() > {noformat} > $ readelf -s --wide be/build/latest/service/impalad | grep ReceiveBuffer > 8: 022f5c88 1936 FUNCGLOBAL DEFAULT 13 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > 81380: 022f5c88 1936 FUNCGLOBAL DEFAULT 13 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > $ readelf -s --wide > toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so > | grep ReceiveBuffer > 1601: 00086e4a 108 FUNCLOCAL DEFAULT 12 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE.cold > 11905: 001fec60 2076 FUNCLOCAL HIDDEN12 > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > $ c++filt > _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE > kudu::rpc::InboundTransfer::ReceiveBuffer(kudu::Socket*, kudu::faststring*) > {noformat} > KRPC flags like rpc_max_message_size are also defined in both the impalad > executable and libkudu_client.so: > {noformat} > $ readelf -s --wide be/build/latest/service/impalad | grep > FLAGS_rpc_max_message_size > 14380: 06006738 8 OBJECT GLOBAL DEFAULT 30 > _ZN5fLI6426FLAGS_rpc_max_message_sizeE > 80396: 06006741 1 OBJECT GLOBAL DEFAULT 30 > _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE > 81399: 06006741 1 OBJECT GLOBAL DEFAULT 30 > _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE > 117873: 06006738 8 OBJECT GLOBAL DEFAULT 30 > _ZN5fLI6426FLAGS_rpc_max_message_sizeE > $ readelf -s --wide > toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so > | grep FLAGS_rpc_max_message_size > 11882: 008d61e1 1 OBJECT LOCAL HIDDEN27 > _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE > 11906: 008d61d8 8 OBJECT LOCAL DEFAULT 27 > _ZN5fLI6426FLAGS_rpc_max_message_sizeE > $ c++filt _ZN5fLI6426FLAGS_rpc_max_message_sizeE > fLI64::FLAGS_rpc_max_message_size {noformat} > libkudu_client.so uses its own methods and flags. The flags are HIDDEN so > can't be modified by Impala codes. E.g. IMPALA-4874 bumps > FLAGS_rpc_max_message_size to 2GB in RpcMgr::Init(), but the HIDDEN variable > FLAGS_rpc_max_message_size used in libkudu_client.so is still the default > value 50MB (52428800). We've seen error messages like this in the master > branch: > {code:java} > I0708 10:23:31.784974 2943 meta_cache.cc:294] > c243bda4702a5ab9:0ba93d240001] tablet 0c8f3446538449ee9d3df5056afe775e: > replica e0e1db54dab74f208e37ea1b975595e5 (127.0.0.1:31202) has failed: > Network error: TS failed: RPC frame had a length of 53477464, but we only > support messages up to 52428800 bytes long.{code} > CC [~joemcdonnell] [~wzhou] [~aserbin] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12906) Incorporate run time scan range information into the tuple cache key
[ https://issues.apache.org/jira/browse/IMPALA-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12906 started by Joe McDonnell. -- > Incorporate run time scan range information into the tuple cache key > > > Key: IMPALA-12906 > URL: https://issues.apache.org/jira/browse/IMPALA-12906 > Project: IMPALA > Issue Type: Task > Components: Backend, Frontend >Affects Versions: Impala 4.4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > The cache key for tuple caching currently doesn't incorporate information > about the scan ranges for the tables that it scans. This is important for > detecting changes in the table and having different cache keys for different > fragment instances that are assigned different scan ranges. > To make this deterministic for mt_dop, we need mt_dop to assign scan ranges > deterministically to individual fragment instances rather than using the > shared queue introduced in IMPALA-9655. > One way to implement this is to collect information about the scan nodes that > feed into the tuple cache and pass that information over to the tuple cache > node. At runtime, it can hash the scan ranges assigned to those scan nodes > and incorporate that into the cache key. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12906) Incorporate run time scan range information into the tuple cache key
[ https://issues.apache.org/jira/browse/IMPALA-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell reassigned IMPALA-12906: -- Assignee: Joe McDonnell > Incorporate run time scan range information into the tuple cache key > > > Key: IMPALA-12906 > URL: https://issues.apache.org/jira/browse/IMPALA-12906 > Project: IMPALA > Issue Type: Task > Components: Backend, Frontend >Affects Versions: Impala 4.4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > The cache key for tuple caching currently doesn't incorporate information > about the scan ranges for the tables that it scans. This is important for > detecting changes in the table and having different cache keys for different > fragment instances that are assigned different scan ranges. > To make this deterministic for mt_dop, we need mt_dop to assign scan ranges > deterministically to individual fragment instances rather than using the > shared queue introduced in IMPALA-9655. > One way to implement this is to collect information about the scan nodes that > feed into the tuple cache and pass that information over to the tuple cache > node. At runtime, it can hash the scan ranges assigned to those scan nodes > and incorporate that into the cache key. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12817) Introduce basic intermediate result caching to speed similar queries
[ https://issues.apache.org/jira/browse/IMPALA-12817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell reassigned IMPALA-12817: -- Assignee: Joe McDonnell > Introduce basic intermediate result caching to speed similar queries > > > Key: IMPALA-12817 > URL: https://issues.apache.org/jira/browse/IMPALA-12817 > Project: IMPALA > Issue Type: Epic >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > This tracks the first phase of intermediate result caching. > The goals of the initial phase are to introduce a basic framework for caching > tuples at various points in the plan. The first location that needs to work > is immediately above an HdfsScanNode. Caching will use a local SSD to store > the cache. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13188) Add test that compute stats does not result in a different tuple cache key
Joe McDonnell created IMPALA-13188: -- Summary: Add test that compute stats does not result in a different tuple cache key Key: IMPALA-13188 URL: https://issues.apache.org/jira/browse/IMPALA-13188 Project: IMPALA Issue Type: Task Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell If someone runs "compute stats" on the underlying tables for a query, the tuple cache key should only change if the plan actually changes. The resource estimates should not be incorporated into the tuple cache key as they have no semantic impact. The code already excludes the resource estimates from the key for the PlanNode, but we should have tests for computing stats and verifying that the key doesn't change. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13188) Add test that compute stats does not result in a different tuple cache key
Joe McDonnell created IMPALA-13188: -- Summary: Add test that compute stats does not result in a different tuple cache key Key: IMPALA-13188 URL: https://issues.apache.org/jira/browse/IMPALA-13188 Project: IMPALA Issue Type: Task Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell If someone runs "compute stats" on the underlying tables for a query, the tuple cache key should only change if the plan actually changes. The resource estimates should not be incorporated into the tuple cache key as they have no semantic impact. The code already excludes the resource estimates from the key for the PlanNode, but we should have tests for computing stats and verifying that the key doesn't change. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13186) Tuple cache keys should incorporate information about related query options
Joe McDonnell created IMPALA-13186: -- Summary: Tuple cache keys should incorporate information about related query options Key: IMPALA-13186 URL: https://issues.apache.org/jira/browse/IMPALA-13186 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Currently, the tuple cache key does not include information from the query options. Many query options have no impact on the result of a query (e.g. idle_session_timeout) or are evaluated purely on the coordinator during planning (e.g. broadcast_bytes_limit). However, some query options can impact behavior either by controlling how certain things are calculated (e.g. decimal_v2) or controlling what conditions result in an error. Changing a query option can change the output of a query. We need some way to incorporate the relevant query options into the tuple cache key so there is no correctness issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13186) Tuple cache keys should incorporate information about related query options
Joe McDonnell created IMPALA-13186: -- Summary: Tuple cache keys should incorporate information about related query options Key: IMPALA-13186 URL: https://issues.apache.org/jira/browse/IMPALA-13186 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Currently, the tuple cache key does not include information from the query options. Many query options have no impact on the result of a query (e.g. idle_session_timeout) or are evaluated purely on the coordinator during planning (e.g. broadcast_bytes_limit). However, some query options can impact behavior either by controlling how certain things are calculated (e.g. decimal_v2) or controlling what conditions result in an error. Changing a query option can change the output of a query. We need some way to incorporate the relevant query options into the tuple cache key so there is no correctness issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13185) Tuple cache keys need to incorporate runtime filter information
Joe McDonnell created IMPALA-13185: -- Summary: Tuple cache keys need to incorporate runtime filter information Key: IMPALA-13185 URL: https://issues.apache.org/jira/browse/IMPALA-13185 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell If a runtime filter impacts the results of a fragment, then the tuple cache key needs to incorporate information about the generation of that runtime filter. This needs to include information about the base tables that impact the runtime filter. For example, suppose there is a join. The build side of the join produces a runtime filter that gets delivered to the probe side of the join. The tuple cache key for the probe side of the join will need to include a representation of the runtime filter. If the table on the build side of the join changes, the tuple cache key for the probe side needs to change due to the possible difference in the runtime filter. This can also impact eligibility. In theory, the build side of a join could be constructed from a source with a limit specified, and this can result in non-determinism. Since the build of the runtime filter is not deterministic, the consumer of the runtime filter is not deterministic and can't participate in tuple caching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13185) Tuple cache keys need to incorporate runtime filter information
Joe McDonnell created IMPALA-13185: -- Summary: Tuple cache keys need to incorporate runtime filter information Key: IMPALA-13185 URL: https://issues.apache.org/jira/browse/IMPALA-13185 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell If a runtime filter impacts the results of a fragment, then the tuple cache key needs to incorporate information about the generation of that runtime filter. This needs to include information about the base tables that impact the runtime filter. For example, suppose there is a join. The build side of the join produces a runtime filter that gets delivered to the probe side of the join. The tuple cache key for the probe side of the join will need to include a representation of the runtime filter. If the table on the build side of the join changes, the tuple cache key for the probe side needs to change due to the possible difference in the runtime filter. This can also impact eligibility. In theory, the build side of a join could be constructed from a source with a limit specified, and this can result in non-determinism. Since the build of the runtime filter is not deterministic, the consumer of the runtime filter is not deterministic and can't participate in tuple caching. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13181) Disable tuple caching for locations that have a limit
Joe McDonnell created IMPALA-13181: -- Summary: Disable tuple caching for locations that have a limit Key: IMPALA-13181 URL: https://issues.apache.org/jira/browse/IMPALA-13181 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Statements that use a limit are non-deterministic unless there is a sort. Locations with limits should be marked ineligible for tuple caching. As an example, for a hash join, suppose the build side has a limit. This means that the build side could vary from run to run. A requirement for our correctness is that all nodes agree on the contents of the build side. The variability of the limit is a problem for the build side, because if one node hits the cache and another does not, there is no guarantee that they agree on the contents of the build side. Concrete example: {noformat} select a.l_orderkey from (select l_orderkey from tpch_parquet.lineitem limit 10) a, tpch_parquet.orders b where a.l_orderkey = b.o_orderkey;{noformat} There are times when limits are deterministic or the non-determinism is harmless. It is safer to ban in completely at first. In a future change, this rule can be relaxed to allow caching in those cases. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13181) Disable tuple caching for locations that have a limit
Joe McDonnell created IMPALA-13181: -- Summary: Disable tuple caching for locations that have a limit Key: IMPALA-13181 URL: https://issues.apache.org/jira/browse/IMPALA-13181 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Statements that use a limit are non-deterministic unless there is a sort. Locations with limits should be marked ineligible for tuple caching. As an example, for a hash join, suppose the build side has a limit. This means that the build side could vary from run to run. A requirement for our correctness is that all nodes agree on the contents of the build side. The variability of the limit is a problem for the build side, because if one node hits the cache and another does not, there is no guarantee that they agree on the contents of the build side. Concrete example: {noformat} select a.l_orderkey from (select l_orderkey from tpch_parquet.lineitem limit 10) a, tpch_parquet.orders b where a.l_orderkey = b.o_orderkey;{noformat} There are times when limits are deterministic or the non-determinism is harmless. It is safer to ban in completely at first. In a future change, this rule can be relaxed to allow caching in those cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13179) Disable tuple caching when using non-deterministic functions
Joe McDonnell created IMPALA-13179: -- Summary: Disable tuple caching when using non-deterministic functions Key: IMPALA-13179 URL: https://issues.apache.org/jira/browse/IMPALA-13179 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Some functions are non-deterministic, so tuple caching needs to detect those functions and avoid caching at locations that are non-deterministic. There are two different pieces: # Correctness: If the key is constant but the results can be variable, then that is a correctness issue. That can happen for genuinely random functions like uuid(). It can happen when timestamp functions like now() are evaluated at runtime. # Performance: The frontend does constant-folding of functions that don't vary during executions, so something like now() might be replaced by a hard-coded integer. This means that the key contains something that varies frequently. That can be a performance issue, because we can be caching things that cannot be reused. This doesn't have the same correctness issue. This ticket is focused on correctness piece. If uuid()/now()/etc are referenced and would be evaluated at runtime, the location should be ineligible for tuple caching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13179) Disable tuple caching when using non-deterministic functions
Joe McDonnell created IMPALA-13179: -- Summary: Disable tuple caching when using non-deterministic functions Key: IMPALA-13179 URL: https://issues.apache.org/jira/browse/IMPALA-13179 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Some functions are non-deterministic, so tuple caching needs to detect those functions and avoid caching at locations that are non-deterministic. There are two different pieces: # Correctness: If the key is constant but the results can be variable, then that is a correctness issue. That can happen for genuinely random functions like uuid(). It can happen when timestamp functions like now() are evaluated at runtime. # Performance: The frontend does constant-folding of functions that don't vary during executions, so something like now() might be replaced by a hard-coded integer. This means that the key contains something that varies frequently. That can be a performance issue, because we can be caching things that cannot be reused. This doesn't have the same correctness issue. This ticket is focused on correctness piece. If uuid()/now()/etc are referenced and would be evaluated at runtime, the location should be ineligible for tuple caching. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IMPALA-12541) Compile toolchain GCC with --enable-linker-build-id to add Build ID to binaries
[ https://issues.apache.org/jira/browse/IMPALA-12541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-12541. Fix Version/s: Impala 4.5.0 Resolution: Fixed > Compile toolchain GCC with --enable-linker-build-id to add Build ID to > binaries > --- > > Key: IMPALA-12541 > URL: https://issues.apache.org/jira/browse/IMPALA-12541 > Project: IMPALA > Issue Type: Improvement > Components: Infrastructure >Affects Versions: Impala 4.4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.5.0 > > > A "Build ID" is a unique identifier for binaries (which is a hash of the > contents). Producing OS packages with separate debug symbols requires each > binary to have a Build ID. This is particularly important for libstdc++, > because it is produced during the native-toolchain build rather than the > regular Impala build. To turn on Build IDs, one can configure that at GCC > build time by specifying "--enable-linker-build-id". This causes GCC to tell > the linker to compute the Build ID. > Breakpad will also use the Build ID when resolving symbols. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12541) Compile toolchain GCC with --enable-linker-build-id to add Build ID to binaries
[ https://issues.apache.org/jira/browse/IMPALA-12541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859971#comment-17859971 ] Joe McDonnell commented on IMPALA-12541: {noformat} commit e78b0ef34241218cda7eac3b526cb6a824596df1 Author: Joe McDonnell Date: Fri Nov 3 14:18:47 2023 -0700 IMPALA-12541: Build GCC with --enable-linker-build-id This builds GCC with --enable-linker-build-id so that binaries have Build ID specified. Build ID is needed to produce OS packages with separate debuginfo. This is particularly important for libstdc++, because it is not built as part of the regular Impala build. Testing: - Verified that resulting binaries have .note.gnu.build-id Change-Id: Ieb2017ba1a348a9e9e549fa3268635afa94ae6d0 Reviewed-on: http://gerrit.cloudera.org:8080/21469 Reviewed-by: Michael Smith Reviewed-by: Laszlo Gaal Tested-by: Joe McDonnell {noformat} > Compile toolchain GCC with --enable-linker-build-id to add Build ID to > binaries > --- > > Key: IMPALA-12541 > URL: https://issues.apache.org/jira/browse/IMPALA-12541 > Project: IMPALA > Issue Type: Improvement > Components: Infrastructure >Affects Versions: Impala 4.4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > A "Build ID" is a unique identifier for binaries (which is a hash of the > contents). Producing OS packages with separate debug symbols requires each > binary to have a Build ID. This is particularly important for libstdc++, > because it is produced during the native-toolchain build rather than the > regular Impala build. To turn on Build IDs, one can configure that at GCC > build time by specifying "--enable-linker-build-id". This causes GCC to tell > the linker to compute the Build ID. > Breakpad will also use the Build ID when resolving symbols. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12541) Compile toolchain GCC with --enable-linker-build-id to add Build ID to binaries
[ https://issues.apache.org/jira/browse/IMPALA-12541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-12541. Fix Version/s: Impala 4.5.0 Resolution: Fixed > Compile toolchain GCC with --enable-linker-build-id to add Build ID to > binaries > --- > > Key: IMPALA-12541 > URL: https://issues.apache.org/jira/browse/IMPALA-12541 > Project: IMPALA > Issue Type: Improvement > Components: Infrastructure >Affects Versions: Impala 4.4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.5.0 > > > A "Build ID" is a unique identifier for binaries (which is a hash of the > contents). Producing OS packages with separate debug symbols requires each > binary to have a Build ID. This is particularly important for libstdc++, > because it is produced during the native-toolchain build rather than the > regular Impala build. To turn on Build IDs, one can configure that at GCC > build time by specifying "--enable-linker-build-id". This causes GCC to tell > the linker to compute the Build ID. > Breakpad will also use the Build ID when resolving symbols. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IMPALA-13121) Move the toolchain to a newer version of ccache
[ https://issues.apache.org/jira/browse/IMPALA-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859962#comment-17859962 ] Joe McDonnell commented on IMPALA-13121: {noformat} commit b9167e985c69fd321e9e25e5ae0c7747682f06f6 Author: Joe McDonnell Date: Fri May 31 15:20:20 2024 -0700 IMPALA-13121: Switch to ccache 3.7.12 The docker images currently build and use ccache 3.3.3. Recently, we ran into a case where debuginfo was being generated even though the cflags ended with -g0. The ccache release history has this note for 3.3.5: - Fixed a regression where the original order of debug options could be lost. This upgrades ccache to 3.7.12 to address this issue. Ccache 3.7.12 is the last ccache release that builds using autotools. Ccache 4 moves to build with CMake. Adding a CMake dependency would be complicated at this stage, because some of the older OSes don't provide a new enough CMake in the package repositories. Since we don't really need the new features of Ccache 4+, this sticks with 3.7.12 for now. This reenables the check_ccache_works() logic in assert-dependencies-present.py. Testing: - Built docker images and ran a toolchain build - The newer ccache resolves the unexpected debuginfo issue Change-Id: I90d751445daa0dc298b634c1049d637a14afac40 Reviewed-on: http://gerrit.cloudera.org:8080/21473 Reviewed-by: Michael Smith Reviewed-by: Laszlo Gaal Tested-by: Joe McDonnell {noformat} > Move the toolchain to a newer version of ccache > --- > > Key: IMPALA-13121 > URL: https://issues.apache.org/jira/browse/IMPALA-13121 > Project: IMPALA > Issue Type: Task > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > The native-toolchain currently uses ccache 3.3.3. In a recent change adding > debug info, I ran into a case where the debug level was not what I expected. > I had added a -g0 at the end to turn off debug information for the cmake > build, but it still ended up with debug info. > The release notes for ccache 3.3.5 says this: > * Fixed a regression where the original order of debug options could be > lost. This reverts the “Improved parsing of {{-g*}} options” feature in > ccache 3.3. > [https://ccache.dev/releasenotes.html#_ccache_3_3_5] > I think I may have been hitting that. We should upgrade ccache to a more > recent version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13121) Move the toolchain to a newer version of ccache
[ https://issues.apache.org/jira/browse/IMPALA-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13121. Fix Version/s: Impala 4.5.0 Resolution: Fixed > Move the toolchain to a newer version of ccache > --- > > Key: IMPALA-13121 > URL: https://issues.apache.org/jira/browse/IMPALA-13121 > Project: IMPALA > Issue Type: Task > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.5.0 > > > The native-toolchain currently uses ccache 3.3.3. In a recent change adding > debug info, I ran into a case where the debug level was not what I expected. > I had added a -g0 at the end to turn off debug information for the cmake > build, but it still ended up with debug info. > The release notes for ccache 3.3.5 says this: > * Fixed a regression where the original order of debug options could be > lost. This reverts the “Improved parsing of {{-g*}} options” feature in > ccache 3.3. > [https://ccache.dev/releasenotes.html#_ccache_3_3_5] > I think I may have been hitting that. We should upgrade ccache to a more > recent version. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IMPALA-13121) Move the toolchain to a newer version of ccache
[ https://issues.apache.org/jira/browse/IMPALA-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13121. Fix Version/s: Impala 4.5.0 Resolution: Fixed > Move the toolchain to a newer version of ccache > --- > > Key: IMPALA-13121 > URL: https://issues.apache.org/jira/browse/IMPALA-13121 > Project: IMPALA > Issue Type: Task > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.5.0 > > > The native-toolchain currently uses ccache 3.3.3. In a recent change adding > debug info, I ran into a case where the debug level was not what I expected. > I had added a -g0 at the end to turn off debug information for the cmake > build, but it still ended up with debug info. > The release notes for ccache 3.3.5 says this: > * Fixed a regression where the original order of debug options could be > lost. This reverts the “Improved parsing of {{-g*}} options” feature in > ccache 3.3. > [https://ccache.dev/releasenotes.html#_ccache_3_3_5] > I think I may have been hitting that. We should upgrade ccache to a more > recent version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13146) Javascript tests sometimes fail to download NodeJS
[ https://issues.apache.org/jira/browse/IMPALA-13146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13146. Fix Version/s: Impala 4.5.0 Resolution: Fixed > Javascript tests sometimes fail to download NodeJS > -- > > Key: IMPALA-13146 > URL: https://issues.apache.org/jira/browse/IMPALA-13146 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > Labels: broken-build, flaky > Fix For: Impala 4.5.0 > > > For automated tests, sometimes the Javascript tests fail to download NodeJS: > {noformat} > 01:37:16 Fetching NodeJS v16.20.2-linux-x64 binaries ... > 01:37:16 % Total% Received % Xferd Average Speed TimeTime > Time Current > 01:37:16 Dload Upload Total Spent > Left Speed > 01:37:16 > 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0 > 0 00 00 0 0 0 --:--:-- 0:00:01 --:--:-- 0 > 0 00 00 0 0 0 --:--:-- 0:00:02 --:--:-- 0 > 0 21.5M0 9020 0293 0 21:23:04 0:00:03 21:23:01 293 > ... > 30 21.5M 30 6776k 0 0 50307 0 0:07:28 0:02:17 0:05:11 23826 > 01:39:34 curl: (18) transfer closed with 15617860 bytes remaining to > read{noformat} > If this keeps happening, we should mirror the NodeJS binary on the > native-toolchain s3 bucket. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13146) Javascript tests sometimes fail to download NodeJS
[ https://issues.apache.org/jira/browse/IMPALA-13146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13146. Fix Version/s: Impala 4.5.0 Resolution: Fixed > Javascript tests sometimes fail to download NodeJS > -- > > Key: IMPALA-13146 > URL: https://issues.apache.org/jira/browse/IMPALA-13146 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > Labels: broken-build, flaky > Fix For: Impala 4.5.0 > > > For automated tests, sometimes the Javascript tests fail to download NodeJS: > {noformat} > 01:37:16 Fetching NodeJS v16.20.2-linux-x64 binaries ... > 01:37:16 % Total% Received % Xferd Average Speed TimeTime > Time Current > 01:37:16 Dload Upload Total Spent > Left Speed > 01:37:16 > 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0 > 0 00 00 0 0 0 --:--:-- 0:00:01 --:--:-- 0 > 0 00 00 0 0 0 --:--:-- 0:00:02 --:--:-- 0 > 0 21.5M0 9020 0293 0 21:23:04 0:00:03 21:23:01 293 > ... > 30 21.5M 30 6776k 0 0 50307 0 0:07:28 0:02:17 0:05:11 23826 > 01:39:34 curl: (18) transfer closed with 15617860 bytes remaining to > read{noformat} > If this keeps happening, we should mirror the NodeJS binary on the > native-toolchain s3 bucket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IMPALA-13136) Refactor AnalyzedFunctionCallExpr
[ https://issues.apache.org/jira/browse/IMPALA-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854526#comment-17854526 ] Joe McDonnell commented on IMPALA-13136: [~scarlin] I'm ok with punting on this for a while. We have a long list of things that need to land, and this is more about code cleanliness than functionality. > Refactor AnalyzedFunctionCallExpr > - > > Key: IMPALA-13136 > URL: https://issues.apache.org/jira/browse/IMPALA-13136 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Steve Carlin >Priority: Major > > Copied from code review: > The part where we immediately analyze as part of the constructor makes for > complicated exception handling. RexVisitor doesn't support exceptions, so it > adds complication to handle them under those circumstances. I can't really > explain why it is necessary. > Let me sketch out an alternative: > 1. Construct the whole Expr tree without analyzing it > 2. Any errors that happen during this process are not usually actionable by > the end user. It's good to have a descriptive error message, but it doesn't > mean there is something wrong with the SQL. I think that it is ok for this > code to throw subclasses of RuntimeException or use > Preconditions.checkState() with a good explanation. > 3. When we get the Expr tree back in CreateExprVisitor::getExpr(), we call > analyze() on the root node, which does a recursive analysis of the whole tree. > 4. The special Expr classes don't run analyze() in the constructor, don't > keep a reference to the Analyzer, and don't override resetAnalysisState(). > They override analyzeImpl() and they should be idempotent. The clone > constructor should not need to do anything special, just do a deep copy. > I don't want to bog down this review. If we want to address this as a > followup, I can live with that, but I don't want us to go too far down this > road. (Or if we have a good explanation for why it is necessary, then we can > write a good comment and move on.) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-13151) DataStreamTestSlowServiceQueue.TestPrioritizeEos fails on ARM
[ https://issues.apache.org/jira/browse/IMPALA-13151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell reassigned IMPALA-13151: -- Assignee: Michael Smith > DataStreamTestSlowServiceQueue.TestPrioritizeEos fails on ARM > - > > Key: IMPALA-13151 > URL: https://issues.apache.org/jira/browse/IMPALA-13151 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 4.4.0 >Reporter: Joe McDonnell >Assignee: Michael Smith >Priority: Critical > Labels: broken-build > > The recently introduced DataStreamTestSlowServiceQueue.TestPrioritizeEos is > failing with errors like this: > {noformat} > /data/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/be/src/runtime/data-stream-test.cc:912 > Expected: (timer.ElapsedTime()) > (3 * MonoTime::kNanosecondsPerSecond), > actual: 269834 vs 30{noformat} > So far, I only see failures on ARM jobs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13151) DataStreamTestSlowServiceQueue.TestPrioritizeEos fails on ARM
Joe McDonnell created IMPALA-13151: -- Summary: DataStreamTestSlowServiceQueue.TestPrioritizeEos fails on ARM Key: IMPALA-13151 URL: https://issues.apache.org/jira/browse/IMPALA-13151 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 4.4.0 Reporter: Joe McDonnell The recently introduced DataStreamTestSlowServiceQueue.TestPrioritizeEos is failing with errors like this: {noformat} /data/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/be/src/runtime/data-stream-test.cc:912 Expected: (timer.ElapsedTime()) > (3 * MonoTime::kNanosecondsPerSecond), actual: 269834 vs 30{noformat} So far, I only see failures on ARM jobs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13151) DataStreamTestSlowServiceQueue.TestPrioritizeEos fails on ARM
Joe McDonnell created IMPALA-13151: -- Summary: DataStreamTestSlowServiceQueue.TestPrioritizeEos fails on ARM Key: IMPALA-13151 URL: https://issues.apache.org/jira/browse/IMPALA-13151 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 4.4.0 Reporter: Joe McDonnell The recently introduced DataStreamTestSlowServiceQueue.TestPrioritizeEos is failing with errors like this: {noformat} /data/jenkins/workspace/impala-asf-master-core-asan-arm/repos/Impala/be/src/runtime/data-stream-test.cc:912 Expected: (timer.ElapsedTime()) > (3 * MonoTime::kNanosecondsPerSecond), actual: 269834 vs 30{noformat} So far, I only see failures on ARM jobs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-13146) Javascript tests sometimes fail to download NodeJS
[ https://issues.apache.org/jira/browse/IMPALA-13146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell reassigned IMPALA-13146: -- Assignee: Joe McDonnell > Javascript tests sometimes fail to download NodeJS > -- > > Key: IMPALA-13146 > URL: https://issues.apache.org/jira/browse/IMPALA-13146 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > Labels: broken-build, flaky > > For automated tests, sometimes the Javascript tests fail to download NodeJS: > {noformat} > 01:37:16 Fetching NodeJS v16.20.2-linux-x64 binaries ... > 01:37:16 % Total% Received % Xferd Average Speed TimeTime > Time Current > 01:37:16 Dload Upload Total Spent > Left Speed > 01:37:16 > 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0 > 0 00 00 0 0 0 --:--:-- 0:00:01 --:--:-- 0 > 0 00 00 0 0 0 --:--:-- 0:00:02 --:--:-- 0 > 0 21.5M0 9020 0293 0 21:23:04 0:00:03 21:23:01 293 > ... > 30 21.5M 30 6776k 0 0 50307 0 0:07:28 0:02:17 0:05:11 23826 > 01:39:34 curl: (18) transfer closed with 15617860 bytes remaining to > read{noformat} > If this keeps happening, we should mirror the NodeJS binary on the > native-toolchain s3 bucket. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13147) Add support for limiting the concurrency of link jobs
Joe McDonnell created IMPALA-13147: -- Summary: Add support for limiting the concurrency of link jobs Key: IMPALA-13147 URL: https://issues.apache.org/jira/browse/IMPALA-13147 Project: IMPALA Issue Type: Improvement Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Link jobs can use a lot of memory due to the amount of debug info. The level of concurrency that is useful for compilation can be too high for linking. Running a link-heavy command like buildall.sh -skiptests can run out of memory from linking all of the backend tests / benchmarks. It would be useful to be able to limit the number of concurrent link jobs. There are two basic approaches: When using the ninja generator for CMake, ninja supports having job pools with limited parallelism. CMake has support for mapping link tasks to their own pool. Here is an example: {noformat} set(CMAKE_JOB_POOLS compilation_pool=24 link_pool=8) set(CMAKE_JOB_POOL_COMPILE compilation_pool) set(CMAKE_JOB_POOL_LINK link_pool){noformat} The makefile generator does not have equivalent functionality, but we could do a more limited version where buildall.sh can split the -skiptests into two make invocations. The first does all the compilation with full parallelism (equivalent to -notests) and then the second make invocation does the backend tests / benchmarks with a reduced parallelism. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13147) Add support for limiting the concurrency of link jobs
Joe McDonnell created IMPALA-13147: -- Summary: Add support for limiting the concurrency of link jobs Key: IMPALA-13147 URL: https://issues.apache.org/jira/browse/IMPALA-13147 Project: IMPALA Issue Type: Improvement Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Link jobs can use a lot of memory due to the amount of debug info. The level of concurrency that is useful for compilation can be too high for linking. Running a link-heavy command like buildall.sh -skiptests can run out of memory from linking all of the backend tests / benchmarks. It would be useful to be able to limit the number of concurrent link jobs. There are two basic approaches: When using the ninja generator for CMake, ninja supports having job pools with limited parallelism. CMake has support for mapping link tasks to their own pool. Here is an example: {noformat} set(CMAKE_JOB_POOLS compilation_pool=24 link_pool=8) set(CMAKE_JOB_POOL_COMPILE compilation_pool) set(CMAKE_JOB_POOL_LINK link_pool){noformat} The makefile generator does not have equivalent functionality, but we could do a more limited version where buildall.sh can split the -skiptests into two make invocations. The first does all the compilation with full parallelism (equivalent to -notests) and then the second make invocation does the backend tests / benchmarks with a reduced parallelism. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13146) Javascript tests sometimes fail to download NodeJS
Joe McDonnell created IMPALA-13146: -- Summary: Javascript tests sometimes fail to download NodeJS Key: IMPALA-13146 URL: https://issues.apache.org/jira/browse/IMPALA-13146 Project: IMPALA Issue Type: Bug Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell For automated tests, sometimes the Javascript tests fail to download NodeJS: {noformat} 01:37:16 Fetching NodeJS v16.20.2-linux-x64 binaries ... 01:37:16 % Total% Received % Xferd Average Speed TimeTime Time Current 01:37:16 Dload Upload Total SpentLeft Speed 01:37:16 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 00 00 0 0 0 --:--:-- 0:00:01 --:--:-- 0 0 00 00 0 0 0 --:--:-- 0:00:02 --:--:-- 0 0 21.5M0 9020 0293 0 21:23:04 0:00:03 21:23:01 293 ... 30 21.5M 30 6776k 0 0 50307 0 0:07:28 0:02:17 0:05:11 23826 01:39:34 curl: (18) transfer closed with 15617860 bytes remaining to read{noformat} If this keeps happening, we should mirror the NodeJS binary on the native-toolchain s3 bucket. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13146) Javascript tests sometimes fail to download NodeJS
Joe McDonnell created IMPALA-13146: -- Summary: Javascript tests sometimes fail to download NodeJS Key: IMPALA-13146 URL: https://issues.apache.org/jira/browse/IMPALA-13146 Project: IMPALA Issue Type: Bug Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell For automated tests, sometimes the Javascript tests fail to download NodeJS: {noformat} 01:37:16 Fetching NodeJS v16.20.2-linux-x64 binaries ... 01:37:16 % Total% Received % Xferd Average Speed TimeTime Time Current 01:37:16 Dload Upload Total SpentLeft Speed 01:37:16 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 00 00 0 0 0 --:--:-- 0:00:01 --:--:-- 0 0 00 00 0 0 0 --:--:-- 0:00:02 --:--:-- 0 0 21.5M0 9020 0293 0 21:23:04 0:00:03 21:23:01 293 ... 30 21.5M 30 6776k 0 0 50307 0 0:07:28 0:02:17 0:05:11 23826 01:39:34 curl: (18) transfer closed with 15617860 bytes remaining to read{noformat} If this keeps happening, we should mirror the NodeJS binary on the native-toolchain s3 bucket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13145) Upgrade mold linker to 2.31.0
Joe McDonnell created IMPALA-13145: -- Summary: Upgrade mold linker to 2.31.0 Key: IMPALA-13145 URL: https://issues.apache.org/jira/browse/IMPALA-13145 Project: IMPALA Issue Type: Improvement Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Mold 2.31.0 claims performance improvements and a reduction in the memory needed for linking. See [https://github.com/rui314/mold/releases/tag/v2.31.0] and [https://github.com/rui314/mold/commit/53ebcd80d888778cde16952270f73343f090f342] We should move to that version as some developers are seeing issues with high memory usage for linking. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13145) Upgrade mold linker to 2.31.0
Joe McDonnell created IMPALA-13145: -- Summary: Upgrade mold linker to 2.31.0 Key: IMPALA-13145 URL: https://issues.apache.org/jira/browse/IMPALA-13145 Project: IMPALA Issue Type: Improvement Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Mold 2.31.0 claims performance improvements and a reduction in the memory needed for linking. See [https://github.com/rui314/mold/releases/tag/v2.31.0] and [https://github.com/rui314/mold/commit/53ebcd80d888778cde16952270f73343f090f342] We should move to that version as some developers are seeing issues with high memory usage for linking. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IMPALA-12967) Testcase fails at test_migrated_table_field_id_resolution due to "Table does not exist"
[ https://issues.apache.org/jira/browse/IMPALA-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853224#comment-17853224 ] Joe McDonnell commented on IMPALA-12967: There is a separate symptom where this test fails with a Disk I/O error. It is probably somewhat related, so we need to decide whether to include that symptom here. See IMPALA-13144. > Testcase fails at test_migrated_table_field_id_resolution due to "Table does > not exist" > --- > > Key: IMPALA-12967 > URL: https://issues.apache.org/jira/browse/IMPALA-12967 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Yida Wu >Assignee: Quanlong Huang >Priority: Major > Labels: broken-build > > Testcase test_migrated_table_field_id_resolution fails at exhaustive release > build with following messages: > *Regression* > {code:java} > query_test.test_iceberg.TestIcebergTable.test_migrated_table_field_id_resolution[protocol: > beeswax | exec_option: {'test_replan': 1, 'batch_size': 0, 'num_nodes': 0, > 'disable_codegen_rows_threshold': 0, 'disable_codegen': True, > 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: > parquet/none] (from pytest) > {code} > *Error Message* > {code:java} > query_test/test_iceberg.py:266: in test_migrated_table_field_id_resolution > "iceberg_migrated_alter_test_orc", "orc") common/file_utils.py:68: in > create_iceberg_table_from_directory file_format)) > common/impala_connection.py:215: in execute > fetch_profile_after_close=fetch_profile_after_close) > beeswax/impala_beeswax.py:191: in execute handle = > self.__execute_query(query_string.strip(), user=user) > beeswax/impala_beeswax.py:384: in __execute_query > self.wait_for_finished(handle) beeswax/impala_beeswax.py:405: in > wait_for_finished raise ImpalaBeeswaxException("Query aborted:" + > error_log, None) E ImpalaBeeswaxException: ImpalaBeeswaxException: E > Query aborted:ImpalaRuntimeException: Error making 'createTable' RPC to Hive > Metastore: E CAUSED BY: IcebergTableLoadingException: Table does not exist > at location: > hdfs://localhost:20500/test-warehouse/iceberg_migrated_alter_test_orc > Stacktrace > query_test/test_iceberg.py:266: in test_migrated_table_field_id_resolution > "iceberg_migrated_alter_test_orc", "orc") > common/file_utils.py:68: in create_iceberg_table_from_directory > file_format)) > common/impala_connection.py:215: in execute > fetch_profile_after_close=fetch_profile_after_close) > beeswax/impala_beeswax.py:191: in execute > handle = self.__execute_query(query_string.strip(), user=user) > beeswax/impala_beeswax.py:384: in __execute_query > self.wait_for_finished(handle) > beeswax/impala_beeswax.py:405: in wait_for_finished > raise ImpalaBeeswaxException("Query aborted:" + error_log, None) > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EQuery aborted:ImpalaRuntimeException: Error making 'createTable' RPC to > Hive Metastore: > E CAUSED BY: IcebergTableLoadingException: Table does not exist at > location: > hdfs://localhost:20500/test-warehouse/iceberg_migrated_alter_test_orc > {code} > *Standard Error* > {code:java} > SET > client_identifier=query_test/test_iceberg.py::TestIcebergTable::()::test_migrated_table_field_id_resolution[protocol:beeswax|exec_option:{'test_replan':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'disable_codegen':True;'abort_on_error':1;'exec_single_; > SET sync_ddl=False; > -- executing against localhost:21000 > DROP DATABASE IF EXISTS `test_migrated_table_field_id_resolution_b59d79db` > CASCADE; > -- 2024-04-02 00:56:55,137 INFO MainThread: Started query > f34399a8b7cddd67:031a3b96 > SET > client_identifier=query_test/test_iceberg.py::TestIcebergTable::()::test_migrated_table_field_id_resolution[protocol:beeswax|exec_option:{'test_replan':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'disable_codegen':True;'abort_on_error':1;'exec_single_; > SET sync_ddl=False; > -- executing against localhost:21000 > CREATE DATABASE `test_migrated_table_field_id_resolution_b59d79db`; > -- 2024-04-02 00:56:57,302 INFO MainThread: Started query > 94465af69907eac5:e33f17e0 > -- 2024-04-02 00:56:57,353 INFO MainThread: Created database > "test_migrated_table_field_id_resolution_b59d79db" for test ID > "query_test/test_iceberg.py::TestIcebergTable::()::test_migrated_table_field_id_resolution[protocol: > beeswax | exec_option: {'test_replan': 1, 'batch_size': 0, 'num_nodes': 0, > 'disable_codegen_rows_threshold': 0, 'disable_codegen': True, > 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: > parquet/none]" > Picked up
[jira] [Commented] (IMPALA-13144) TestIcebergTable.test_migrated_table_field_id_resolution fails with Disk I/O error
[ https://issues.apache.org/jira/browse/IMPALA-13144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853223#comment-17853223 ] Joe McDonnell commented on IMPALA-13144: We need to decide whether we want to track this with IMPALA-12967 (which was originally about "Table does not exist at location" on the same test) or keep it separate. > TestIcebergTable.test_migrated_table_field_id_resolution fails with Disk I/O > error > -- > > Key: IMPALA-13144 > URL: https://issues.apache.org/jira/browse/IMPALA-13144 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Priority: Critical > Labels: broken-build, flaky > > A couple test jobs hit a failure on > TestIcebergTable.test_migrated_table_field_id_resolution: > {noformat} > query_test/test_iceberg.py:270: in test_migrated_table_field_id_resolution > vector, unique_database) > common/impala_test_suite.py:725: in run_test_case > result = exec_fn(query, user=test_section.get('USER', '').strip() or None) > common/impala_test_suite.py:660: in __exec_in_impala > result = self.__execute_query(target_impalad_client, query, user=user) > common/impala_test_suite.py:1013: in __execute_query > return impalad_client.execute(query, user=user) > common/impala_connection.py:216: in execute > fetch_profile_after_close=fetch_profile_after_close) > beeswax/impala_beeswax.py:191: in execute > handle = self.__execute_query(query_string.strip(), user=user) > beeswax/impala_beeswax.py:384: in __execute_query > self.wait_for_finished(handle) > beeswax/impala_beeswax.py:405: in wait_for_finished > raise ImpalaBeeswaxException("Query aborted:" + error_log, None) > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EQuery aborted:Disk I/O error on > impala-ec2-centos79-m6i-4xlarge-xldisk-153e.vpc.cloudera.com:27000: Failed to > open HDFS file > hdfs://localhost:20500/test-warehouse/iceberg_migrated_alter_test/00_0 > E Error(2): No such file or directory > E Root cause: RemoteException: File does not exist: > /test-warehouse/iceberg_migrated_alter_test/00_0 > E at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:87) > E at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:77) > E at > org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:159) > E at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2040) > E at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:738) > E at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:454) > E at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > E at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533) > E at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > E at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994) > E at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922) > E at java.security.AccessController.doPrivileged(Native Method) > E at javax.security.auth.Subject.doAs(Subject.java:422) > E at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > E at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899){noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13144) TestIcebergTable.test_migrated_table_field_id_resolution fails with Disk I/O error
Joe McDonnell created IMPALA-13144: -- Summary: TestIcebergTable.test_migrated_table_field_id_resolution fails with Disk I/O error Key: IMPALA-13144 URL: https://issues.apache.org/jira/browse/IMPALA-13144 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell A couple test jobs hit a failure on TestIcebergTable.test_migrated_table_field_id_resolution: {noformat} query_test/test_iceberg.py:270: in test_migrated_table_field_id_resolution vector, unique_database) common/impala_test_suite.py:725: in run_test_case result = exec_fn(query, user=test_section.get('USER', '').strip() or None) common/impala_test_suite.py:660: in __exec_in_impala result = self.__execute_query(target_impalad_client, query, user=user) common/impala_test_suite.py:1013: in __execute_query return impalad_client.execute(query, user=user) common/impala_connection.py:216: in execute fetch_profile_after_close=fetch_profile_after_close) beeswax/impala_beeswax.py:191: in execute handle = self.__execute_query(query_string.strip(), user=user) beeswax/impala_beeswax.py:384: in __execute_query self.wait_for_finished(handle) beeswax/impala_beeswax.py:405: in wait_for_finished raise ImpalaBeeswaxException("Query aborted:" + error_log, None) E ImpalaBeeswaxException: ImpalaBeeswaxException: EQuery aborted:Disk I/O error on impala-ec2-centos79-m6i-4xlarge-xldisk-153e.vpc.cloudera.com:27000: Failed to open HDFS file hdfs://localhost:20500/test-warehouse/iceberg_migrated_alter_test/00_0 E Error(2): No such file or directory E Root cause: RemoteException: File does not exist: /test-warehouse/iceberg_migrated_alter_test/00_0 E at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:87) E at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:77) E at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:159) E at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2040) E at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:738) E at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:454) E at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) E at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533) E at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) E at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994) E at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922) E at java.security.AccessController.doPrivileged(Native Method) E at javax.security.auth.Subject.doAs(Subject.java:422) E at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) E at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899){noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13144) TestIcebergTable.test_migrated_table_field_id_resolution fails with Disk I/O error
Joe McDonnell created IMPALA-13144: -- Summary: TestIcebergTable.test_migrated_table_field_id_resolution fails with Disk I/O error Key: IMPALA-13144 URL: https://issues.apache.org/jira/browse/IMPALA-13144 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell A couple test jobs hit a failure on TestIcebergTable.test_migrated_table_field_id_resolution: {noformat} query_test/test_iceberg.py:270: in test_migrated_table_field_id_resolution vector, unique_database) common/impala_test_suite.py:725: in run_test_case result = exec_fn(query, user=test_section.get('USER', '').strip() or None) common/impala_test_suite.py:660: in __exec_in_impala result = self.__execute_query(target_impalad_client, query, user=user) common/impala_test_suite.py:1013: in __execute_query return impalad_client.execute(query, user=user) common/impala_connection.py:216: in execute fetch_profile_after_close=fetch_profile_after_close) beeswax/impala_beeswax.py:191: in execute handle = self.__execute_query(query_string.strip(), user=user) beeswax/impala_beeswax.py:384: in __execute_query self.wait_for_finished(handle) beeswax/impala_beeswax.py:405: in wait_for_finished raise ImpalaBeeswaxException("Query aborted:" + error_log, None) E ImpalaBeeswaxException: ImpalaBeeswaxException: EQuery aborted:Disk I/O error on impala-ec2-centos79-m6i-4xlarge-xldisk-153e.vpc.cloudera.com:27000: Failed to open HDFS file hdfs://localhost:20500/test-warehouse/iceberg_migrated_alter_test/00_0 E Error(2): No such file or directory E Root cause: RemoteException: File does not exist: /test-warehouse/iceberg_migrated_alter_test/00_0 E at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:87) E at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:77) E at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:159) E at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2040) E at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:738) E at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:454) E at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) E at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533) E at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) E at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994) E at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922) E at java.security.AccessController.doPrivileged(Native Method) E at javax.security.auth.Subject.doAs(Subject.java:422) E at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) E at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899){noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13143) TestCatalogdHA.test_catalogd_failover_with_sync_ddl times out expecting query failure
Joe McDonnell created IMPALA-13143: -- Summary: TestCatalogdHA.test_catalogd_failover_with_sync_ddl times out expecting query failure Key: IMPALA-13143 URL: https://issues.apache.org/jira/browse/IMPALA-13143 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell The new TestCatalogdHA.test_catalogd_failover_with_sync_ddl test is failing intermittently with: {noformat} custom_cluster/test_catalogd_ha.py:472: in test_catalogd_failover_with_sync_ddl self.wait_for_state(handle, QueryState.EXCEPTION, 30, client=client) common/impala_test_suite.py:1216: in wait_for_state self.wait_for_any_state(handle, [expected_state], timeout, client) common/impala_test_suite.py:1234: in wait_for_any_state raise Timeout(timeout_msg) E Timeout: query '9d49ab6360f6cbc5:4826a796' did not reach one of the expected states [5], last known state 4{noformat} This means the query succeeded even though we expected it to fail. This is currently limited to s3 jobs. In a different test, we saw issues because s3 is slower (see IMPALA-12616). This test was introduced by IMPALA-13134: https://github.com/apache/impala/commit/70b7b6a78d49c30933d79e0a1c2a725f7e0a3e50 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13143) TestCatalogdHA.test_catalogd_failover_with_sync_ddl times out expecting query failure
Joe McDonnell created IMPALA-13143: -- Summary: TestCatalogdHA.test_catalogd_failover_with_sync_ddl times out expecting query failure Key: IMPALA-13143 URL: https://issues.apache.org/jira/browse/IMPALA-13143 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell The new TestCatalogdHA.test_catalogd_failover_with_sync_ddl test is failing intermittently with: {noformat} custom_cluster/test_catalogd_ha.py:472: in test_catalogd_failover_with_sync_ddl self.wait_for_state(handle, QueryState.EXCEPTION, 30, client=client) common/impala_test_suite.py:1216: in wait_for_state self.wait_for_any_state(handle, [expected_state], timeout, client) common/impala_test_suite.py:1234: in wait_for_any_state raise Timeout(timeout_msg) E Timeout: query '9d49ab6360f6cbc5:4826a796' did not reach one of the expected states [5], last known state 4{noformat} This means the query succeeded even though we expected it to fail. This is currently limited to s3 jobs. In a different test, we saw issues because s3 is slower (see IMPALA-12616). This test was introduced by IMPALA-13134: https://github.com/apache/impala/commit/70b7b6a78d49c30933d79e0a1c2a725f7e0a3e50 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12616) test_restart_catalogd_while_handling_rpc_response* tests fail not reaching expected states
[ https://issues.apache.org/jira/browse/IMPALA-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-12616. Fix Version/s: Impala 4.5.0 Resolution: Fixed I think the s3 slowness version of this is fixed, so I'm going to resolve this. > test_restart_catalogd_while_handling_rpc_response* tests fail not reaching > expected states > -- > > Key: IMPALA-12616 > URL: https://issues.apache.org/jira/browse/IMPALA-12616 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 1.4.2 >Reporter: Andrew Sherman >Assignee: Daniel Becker >Priority: Critical > Fix For: Impala 4.5.0 > > > There are failures in both > custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_timeout > and > custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_max_iters, > both look the same: > {code:java} > custom_cluster/test_restart_services.py:232: in > test_restart_catalogd_while_handling_rpc_response_with_timeout > self.wait_for_state(handle, self.client.QUERY_STATES["FINISHED"], > max_wait_time) > common/impala_test_suite.py:1181: in wait_for_state > self.wait_for_any_state(handle, [expected_state], timeout, client) > common/impala_test_suite.py:1199: in wait_for_any_state > raise Timeout(timeout_msg) > E Timeout: query '6a4e0bad9b511ccf:bf93de68' did not reach one of > the expected states [4], last known state 5 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IMPALA-12616) test_restart_catalogd_while_handling_rpc_response* tests fail not reaching expected states
[ https://issues.apache.org/jira/browse/IMPALA-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-12616. Fix Version/s: Impala 4.5.0 Resolution: Fixed I think the s3 slowness version of this is fixed, so I'm going to resolve this. > test_restart_catalogd_while_handling_rpc_response* tests fail not reaching > expected states > -- > > Key: IMPALA-12616 > URL: https://issues.apache.org/jira/browse/IMPALA-12616 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 1.4.2 >Reporter: Andrew Sherman >Assignee: Daniel Becker >Priority: Critical > Fix For: Impala 4.5.0 > > > There are failures in both > custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_timeout > and > custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_max_iters, > both look the same: > {code:java} > custom_cluster/test_restart_services.py:232: in > test_restart_catalogd_while_handling_rpc_response_with_timeout > self.wait_for_state(handle, self.client.QUERY_STATES["FINISHED"], > max_wait_time) > common/impala_test_suite.py:1181: in wait_for_state > self.wait_for_any_state(handle, [expected_state], timeout, client) > common/impala_test_suite.py:1199: in wait_for_any_state > raise Timeout(timeout_msg) > E Timeout: query '6a4e0bad9b511ccf:bf93de68' did not reach one of > the expected states [4], last known state 5 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13139) Query options set via ImpalaTestSuite::execute_query_expect_success stay set for subsequent queries
[ https://issues.apache.org/jira/browse/IMPALA-13139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell updated IMPALA-13139: --- Description: When debugging TestRestart, I noticed that the debug_action set for one query stayed in effect for subsequent queries that didn't specify query_options. {noformat} DEBUG_ACTION = ("WAIT_BEFORE_PROCESSING_CATALOG_UPDATE:SLEEP@{}" .format(debug_action_sleep_time_sec * 1000)) query = "alter table {} add columns (age int)".format(tbl_name) handle = self.execute_query_async(query, query_options={"debug_action": DEBUG_ACTION}) ... # debug_action is still set for these queries: self.execute_query_expect_success(self.client, "select age from {}".format(tbl_name)) self.execute_query_expect_success(self.client, "alter table {} add columns (name string)".format(tbl_name)) self.execute_query_expect_success(self.client, "select name from {}".format(tbl_name)){noformat} There is a way to clear the query options (self.client.clear_configuration()), but this is an odd behavior. It's unclear if some tests rely on this behavior. > Query options set via ImpalaTestSuite::execute_query_expect_success stay set > for subsequent queries > --- > > Key: IMPALA-13139 > URL: https://issues.apache.org/jira/browse/IMPALA-13139 > Project: IMPALA > Issue Type: Task > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Priority: Major > > When debugging TestRestart, I noticed that the debug_action set for one query > stayed in effect for subsequent queries that didn't specify query_options. > {noformat} > DEBUG_ACTION = ("WAIT_BEFORE_PROCESSING_CATALOG_UPDATE:SLEEP@{}" > .format(debug_action_sleep_time_sec * 1000)) > query = "alter table {} add columns (age int)".format(tbl_name) > handle = self.execute_query_async(query, query_options={"debug_action": > DEBUG_ACTION}) > ... > # debug_action is still set for these queries: > self.execute_query_expect_success(self.client, "select age from > {}".format(tbl_name)) > self.execute_query_expect_success(self.client, > "alter table {} add columns (name string)".format(tbl_name)) > self.execute_query_expect_success(self.client, "select name from > {}".format(tbl_name)){noformat} > There is a way to clear the query options > (self.client.clear_configuration()), but this is an odd behavior. It's > unclear if some tests rely on this behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13139) Query options set via ImpalaTestSuite::execute_query_expect_success stay set for subsequent queries
Joe McDonnell created IMPALA-13139: -- Summary: Query options set via ImpalaTestSuite::execute_query_expect_success stay set for subsequent queries Key: IMPALA-13139 URL: https://issues.apache.org/jira/browse/IMPALA-13139 Project: IMPALA Issue Type: Task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13139) Query options set via ImpalaTestSuite::execute_query_expect_success stay set for subsequent queries
Joe McDonnell created IMPALA-13139: -- Summary: Query options set via ImpalaTestSuite::execute_query_expect_success stay set for subsequent queries Key: IMPALA-13139 URL: https://issues.apache.org/jira/browse/IMPALA-13139 Project: IMPALA Issue Type: Task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12616) test_restart_catalogd_while_handling_rpc_response* tests fail not reaching expected states
[ https://issues.apache.org/jira/browse/IMPALA-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852961#comment-17852961 ] Joe McDonnell commented on IMPALA-12616: This is looking timing-related. I was able to get this to pass by adjusting some of the sleep times. Basically, it looks like the catalog is slower on s3 and some operations don't finish in the time we thought they would. {noformat} debug_action_sleep_time_sec = 10 (NEW: 30) DEBUG_ACTION = ("WAIT_BEFORE_PROCESSING_CATALOG_UPDATE:SLEEP@{}" .format(debug_action_sleep_time_sec * 1000)) query = "alter table {} add columns (age int)".format(tbl_name) handle = self.execute_query_async(query, query_options={"debug_action": DEBUG_ACTION}) # Wait a bit so the RPC from the catalogd arrives to the coordinator. time.sleep(0.5) (NEW: 5) self.cluster.catalogd.restart() # Wait for the query to finish. max_wait_time = (debug_action_sleep_time_sec + self.WAIT_FOR_CATALOG_UPDATE_TIMEOUT_SEC + 10) self.wait_for_state(handle, self.client.QUERY_STATES["FINISHED"], max_wait_time){noformat} A successful timeline looks like this: # Submit an alter table that sleeps before processing the catalog update # Sleep a little bit so the catalog knows about the alter table # Restart the catalogd # The catalog sends an update via the statestore. This has the new catalog ID and causes this message: "There was an error processing the impalad catalog update. Requesting a full topic update to recover: CatalogException: Detected catalog service ID changes from 9c9f7ff13f0e4f72:a896bee4d52fd37e to da67610b2c304198:a05daf1bc3d6a4b3. Aborting updateCatalog()" # The catalogd sends a full topic update # The alter table wakes up and prints this message: Catalog service ID mismatch. Current ID: da67610b2c304198:a05daf1bc3d6a4b3. ID in response: 9c9f7ff13f0e4f72:a896bee4d52fd37e. Catalogd may have been restarted. Waiting for new catalog update from statestore. # Either it times out or there are too many non-empty updates, and the alter table bails out with "W0506 22:42:10.316627 23066 impala-server.cc:2369] e14b23a22458ab75:6b269414] Ignoring catalog update result of catalog service ID 9c9f7ff13f0e4f72:a896bee4d52fd37e because it does not match with current catalog service ID da67610b2c304198:a05daf1bc3d6a4b3. The current catalog service ID may be stale (this may be caused by the catalogd having been restarted more than once) or newer than the catalog service ID of the update result." If the alter table wakes up from its sleep before #5 happens, the alter table will see the catalog service ID change and fail. To avoid that, we adjust the WAIT_BEFORE_PROCESSING_CATALOG_UPDATE higher. I also lengthened the sleep in #2 to give the initial catalog some extra time to hear about the alter table. The test verifies that the logs contain the expected messages, so this should be a safe modification to the test. > test_restart_catalogd_while_handling_rpc_response* tests fail not reaching > expected states > -- > > Key: IMPALA-12616 > URL: https://issues.apache.org/jira/browse/IMPALA-12616 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 1.4.2 >Reporter: Andrew Sherman >Assignee: Daniel Becker >Priority: Critical > > There are failures in both > custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_timeout > and > custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_max_iters, > both look the same: > {code:java} > custom_cluster/test_restart_services.py:232: in > test_restart_catalogd_while_handling_rpc_response_with_timeout > self.wait_for_state(handle, self.client.QUERY_STATES["FINISHED"], > max_wait_time) > common/impala_test_suite.py:1181: in wait_for_state > self.wait_for_any_state(handle, [expected_state], timeout, client) > common/impala_test_suite.py:1199: in wait_for_any_state > raise Timeout(timeout_msg) > E Timeout: query '6a4e0bad9b511ccf:bf93de68' did not reach one of > the expected states [4], last known state 5 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13132) Ozone jobs see intermittent termination of Ozone manager / HMS fails to start
Joe McDonnell created IMPALA-13132: -- Summary: Ozone jobs see intermittent termination of Ozone manager / HMS fails to start Key: IMPALA-13132 URL: https://issues.apache.org/jira/browse/IMPALA-13132 Project: IMPALA Issue Type: Bug Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Ozone jobs load data/metadata snapshots during dataload, then restarts the cluster. On this restart, the HMS sometimes fails to come up: {noformat} 16:04:13 --> Starting Hive Metastore Service 16:04:13 No handlers could be found for logger "thrift.transport.TSocket" 16:04:14 Waiting for the Metastore at localhost:9083... ... 16:09:14 Waiting for the Metastore at localhost:9083... 16:09:14 Metastore service failed to start within 300.0 seconds.{noformat} In the metastore logs, we see messages like this: {noformat} 2024-06-04T08:37:06,425 INFO [main] retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From hostname/127.0.0.1 to localhost:9862 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.submitRequest over nodeId=null,nodeAddress=localhost:9862 after 1 failover attempts. Trying to failover after sleeping for 4000ms.{noformat} It's trying to talk to the Ozone manager. The Ozone cluster was back up and running before trying to start the HMS, but then the Ozone manager received a signal and shutdown: {noformat} 24/06/04 08:36:37 ERROR om.OzoneManagerStarter: RECEIVED SIGNAL 15: SIGTERM 24/06/04 08:36:37 INFO om.OzoneManagerStarter: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down OzoneManager at hostname/127.0.0.1 / 24/06/04 08:36:37 INFO om.OzoneManager: om1[localhost:9862]: Stopping Ozone Manager{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13132) Ozone jobs see intermittent termination of Ozone manager / HMS fails to start
Joe McDonnell created IMPALA-13132: -- Summary: Ozone jobs see intermittent termination of Ozone manager / HMS fails to start Key: IMPALA-13132 URL: https://issues.apache.org/jira/browse/IMPALA-13132 Project: IMPALA Issue Type: Bug Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Ozone jobs load data/metadata snapshots during dataload, then restarts the cluster. On this restart, the HMS sometimes fails to come up: {noformat} 16:04:13 --> Starting Hive Metastore Service 16:04:13 No handlers could be found for logger "thrift.transport.TSocket" 16:04:14 Waiting for the Metastore at localhost:9083... ... 16:09:14 Waiting for the Metastore at localhost:9083... 16:09:14 Metastore service failed to start within 300.0 seconds.{noformat} In the metastore logs, we see messages like this: {noformat} 2024-06-04T08:37:06,425 INFO [main] retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From hostname/127.0.0.1 to localhost:9862 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.submitRequest over nodeId=null,nodeAddress=localhost:9862 after 1 failover attempts. Trying to failover after sleeping for 4000ms.{noformat} It's trying to talk to the Ozone manager. The Ozone cluster was back up and running before trying to start the HMS, but then the Ozone manager received a signal and shutdown: {noformat} 24/06/04 08:36:37 ERROR om.OzoneManagerStarter: RECEIVED SIGNAL 15: SIGTERM 24/06/04 08:36:37 INFO om.OzoneManagerStarter: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down OzoneManager at hostname/127.0.0.1 / 24/06/04 08:36:37 INFO om.OzoneManager: om1[localhost:9862]: Stopping Ozone Manager{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12616) test_restart_catalogd_while_handling_rpc_response* tests fail not reaching expected states
[ https://issues.apache.org/jira/browse/IMPALA-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851904#comment-17851904 ] Joe McDonnell commented on IMPALA-12616: I switched the code to use self.client.wait_for_finished_timeout(), which will stop if it reaches either FINISHED or EXCEPTION. Here is the error it hits: {noformat} custom_cluster/test_restart_services.py:238: in test_restart_catalogd_while_handling_rpc_response_with_timeout finished = self.client.wait_for_finished_timeout(handle, max_wait_time) common/impala_connection.py:247: in wait_for_finished_timeout operation_handle.get_handle(), timeout) beeswax/impala_beeswax.py:423: in wait_for_finished_timeout raise ImpalaBeeswaxException("Query aborted:" + error_log, None) E ImpalaBeeswaxException: ImpalaBeeswaxException: EQuery aborted:CatalogException: Detected catalog service ID changes from b0019607521f4f0a:8340b9882af1a856 to a4f8584219b34182:9b3cf9af859a0d54. Aborting updateCatalog(){noformat} > test_restart_catalogd_while_handling_rpc_response* tests fail not reaching > expected states > -- > > Key: IMPALA-12616 > URL: https://issues.apache.org/jira/browse/IMPALA-12616 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 1.4.2 >Reporter: Andrew Sherman >Assignee: Daniel Becker >Priority: Critical > > There are failures in both > custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_timeout > and > custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_max_iters, > both look the same: > {code:java} > custom_cluster/test_restart_services.py:232: in > test_restart_catalogd_while_handling_rpc_response_with_timeout > self.wait_for_state(handle, self.client.QUERY_STATES["FINISHED"], > max_wait_time) > common/impala_test_suite.py:1181: in wait_for_state > self.wait_for_any_state(handle, [expected_state], timeout, client) > common/impala_test_suite.py:1199: in wait_for_any_state > raise Timeout(timeout_msg) > E Timeout: query '6a4e0bad9b511ccf:bf93de68' did not reach one of > the expected states [4], last known state 5 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12616) test_restart_catalogd_while_handling_rpc_response* tests fail not reaching expected states
[ https://issues.apache.org/jira/browse/IMPALA-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851840#comment-17851840 ] Joe McDonnell commented on IMPALA-12616: This is now failing pretty consistently on a variety of s3 jobs (but only s3 jobs). I think the first thing we could do is modify wait_for_any_state() to detect the terminal state (EXCEPTION) and print the error. In general, it would be good for wait_for_state() to know about terminal states. > test_restart_catalogd_while_handling_rpc_response* tests fail not reaching > expected states > -- > > Key: IMPALA-12616 > URL: https://issues.apache.org/jira/browse/IMPALA-12616 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 1.4.2 >Reporter: Andrew Sherman >Assignee: Daniel Becker >Priority: Critical > > There are failures in both > custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_timeout > and > custom_cluster.test_restart_services.TestRestart.test_restart_catalogd_while_handling_rpc_response_with_max_iters, > both look the same: > {code:java} > custom_cluster/test_restart_services.py:232: in > test_restart_catalogd_while_handling_rpc_response_with_timeout > self.wait_for_state(handle, self.client.QUERY_STATES["FINISHED"], > max_wait_time) > common/impala_test_suite.py:1181: in wait_for_state > self.wait_for_any_state(handle, [expected_state], timeout, client) > common/impala_test_suite.py:1199: in wait_for_any_state > raise Timeout(timeout_msg) > E Timeout: query '6a4e0bad9b511ccf:bf93de68' did not reach one of > the expected states [4], last known state 5 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-13128) disk-file-test hangs on ARM + UBSAN test jobs
[ https://issues.apache.org/jira/browse/IMPALA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851831#comment-17851831 ] Joe McDonnell commented on IMPALA-13128: It looks intermittent, so adding "flaky" label > disk-file-test hangs on ARM + UBSAN test jobs > - > > Key: IMPALA-13128 > URL: https://issues.apache.org/jira/browse/IMPALA-13128 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Priority: Critical > Labels: broken-build, flaky > > The UBSAN ARM job (running on Redhat 8) has been hanging then timing out with > this being the last output: > {noformat} > 23:06:47 63/147 Test #63: disk-io-mgr-test . Passed > 43.42 sec > 23:07:30 Start 64: disk-file-test > 23:07:30 > 18:47:00 > 18:47:00 run-all-tests.sh TIMED OUT! {noformat} > This has happened multiple times, but it looks limited to ARM + UBSAN. The > jobs take stack traces, but only of the running impalads / HMS. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13128) disk-file-test hangs on ARM + UBSAN test jobs
Joe McDonnell created IMPALA-13128: -- Summary: disk-file-test hangs on ARM + UBSAN test jobs Key: IMPALA-13128 URL: https://issues.apache.org/jira/browse/IMPALA-13128 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell The UBSAN ARM job (running on Redhat 8) has been hanging then timing out with this being the last output: {noformat} 23:06:47 63/147 Test #63: disk-io-mgr-test . Passed 43.42 sec 23:07:30 Start 64: disk-file-test 23:07:30 18:47:00 18:47:00 run-all-tests.sh TIMED OUT! {noformat} This has happened multiple times, but it looks limited to ARM + UBSAN. The jobs take stack traces, but only of the running impalads / HMS. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13128) disk-file-test hangs on ARM + UBSAN test jobs
Joe McDonnell created IMPALA-13128: -- Summary: disk-file-test hangs on ARM + UBSAN test jobs Key: IMPALA-13128 URL: https://issues.apache.org/jira/browse/IMPALA-13128 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell The UBSAN ARM job (running on Redhat 8) has been hanging then timing out with this being the last output: {noformat} 23:06:47 63/147 Test #63: disk-io-mgr-test . Passed 43.42 sec 23:07:30 Start 64: disk-file-test 23:07:30 18:47:00 18:47:00 run-all-tests.sh TIMED OUT! {noformat} This has happened multiple times, but it looks limited to ARM + UBSAN. The jobs take stack traces, but only of the running impalads / HMS. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13128) disk-file-test hangs on ARM + UBSAN test jobs
[ https://issues.apache.org/jira/browse/IMPALA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell updated IMPALA-13128: --- Labels: broken-build flaky (was: broken-build) > disk-file-test hangs on ARM + UBSAN test jobs > - > > Key: IMPALA-13128 > URL: https://issues.apache.org/jira/browse/IMPALA-13128 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Priority: Critical > Labels: broken-build, flaky > > The UBSAN ARM job (running on Redhat 8) has been hanging then timing out with > this being the last output: > {noformat} > 23:06:47 63/147 Test #63: disk-io-mgr-test . Passed > 43.42 sec > 23:07:30 Start 64: disk-file-test > 23:07:30 > 18:47:00 > 18:47:00 run-all-tests.sh TIMED OUT! {noformat} > This has happened multiple times, but it looks limited to ARM + UBSAN. The > jobs take stack traces, but only of the running impalads / HMS. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13127) custom_cluster/test_runtime_filter_aggregation.py is failing on ASAN jobs
[ https://issues.apache.org/jira/browse/IMPALA-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13127. Fix Version/s: Not Applicable Resolution: Duplicate Fixed by followup change in IMPALA-13040, closing as duplicate. > custom_cluster/test_runtime_filter_aggregation.py is failing on ASAN jobs > - > > Key: IMPALA-13127 > URL: https://issues.apache.org/jira/browse/IMPALA-13127 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Priority: Critical > Labels: broken-build, flaky > Fix For: Not Applicable > > > ASAN jobs have been intermittently hitting a failure in > custom_cluster.test_runtime_filter_aggregation.TestLateQueryStateInit.test_late_query_state_init(): > {noformat} > custom_cluster/test_runtime_filter_aggregation.py:129: in > test_late_query_state_init > self.assert_log_contains('impalad_node1', 'INFO', log_pattern, expected) > common/impala_test_suite.py:1383: in assert_log_contains > ", but found none." % (log_file_path, line_regex) > E AssertionError: Expected at least one line in file > /data0/jenkins/workspace/impala-cdwh-2024.0.18.0-core-asan-arm/repos/Impala/logs/custom_cluster_tests/impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-077e.vpc.cloudera.com.jenkins.log.INFO.20240603-025918.3562162 > matching regex 'UpdateFilterFromRemote RPC called with remaining wait time', > but found none.{noformat} > Seen on an ARM job and an x86_64 job, so it is probably not an architecture > specific thing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IMPALA-13127) custom_cluster/test_runtime_filter_aggregation.py is failing on ASAN jobs
[ https://issues.apache.org/jira/browse/IMPALA-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13127. Fix Version/s: Not Applicable Resolution: Duplicate Fixed by followup change in IMPALA-13040, closing as duplicate. > custom_cluster/test_runtime_filter_aggregation.py is failing on ASAN jobs > - > > Key: IMPALA-13127 > URL: https://issues.apache.org/jira/browse/IMPALA-13127 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Priority: Critical > Labels: broken-build, flaky > Fix For: Not Applicable > > > ASAN jobs have been intermittently hitting a failure in > custom_cluster.test_runtime_filter_aggregation.TestLateQueryStateInit.test_late_query_state_init(): > {noformat} > custom_cluster/test_runtime_filter_aggregation.py:129: in > test_late_query_state_init > self.assert_log_contains('impalad_node1', 'INFO', log_pattern, expected) > common/impala_test_suite.py:1383: in assert_log_contains > ", but found none." % (log_file_path, line_regex) > E AssertionError: Expected at least one line in file > /data0/jenkins/workspace/impala-cdwh-2024.0.18.0-core-asan-arm/repos/Impala/logs/custom_cluster_tests/impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-077e.vpc.cloudera.com.jenkins.log.INFO.20240603-025918.3562162 > matching regex 'UpdateFilterFromRemote RPC called with remaining wait time', > but found none.{noformat} > Seen on an ARM job and an x86_64 job, so it is probably not an architecture > specific thing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13127) custom_cluster/test_runtime_filter_aggregation.py is failing on ASAN jobs
Joe McDonnell created IMPALA-13127: -- Summary: custom_cluster/test_runtime_filter_aggregation.py is failing on ASAN jobs Key: IMPALA-13127 URL: https://issues.apache.org/jira/browse/IMPALA-13127 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell ASAN jobs have been intermittently hitting a failure in custom_cluster.test_runtime_filter_aggregation.TestLateQueryStateInit.test_late_query_state_init(): {noformat} custom_cluster/test_runtime_filter_aggregation.py:129: in test_late_query_state_init self.assert_log_contains('impalad_node1', 'INFO', log_pattern, expected) common/impala_test_suite.py:1383: in assert_log_contains ", but found none." % (log_file_path, line_regex) E AssertionError: Expected at least one line in file /data0/jenkins/workspace/impala-cdwh-2024.0.18.0-core-asan-arm/repos/Impala/logs/custom_cluster_tests/impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-077e.vpc.cloudera.com.jenkins.log.INFO.20240603-025918.3562162 matching regex 'UpdateFilterFromRemote RPC called with remaining wait time', but found none.{noformat} Seen on an ARM job and an x86_64 job, so it is probably not an architecture specific thing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13127) custom_cluster/test_runtime_filter_aggregation.py is failing on ASAN jobs
Joe McDonnell created IMPALA-13127: -- Summary: custom_cluster/test_runtime_filter_aggregation.py is failing on ASAN jobs Key: IMPALA-13127 URL: https://issues.apache.org/jira/browse/IMPALA-13127 Project: IMPALA Issue Type: Bug Components: Backend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell ASAN jobs have been intermittently hitting a failure in custom_cluster.test_runtime_filter_aggregation.TestLateQueryStateInit.test_late_query_state_init(): {noformat} custom_cluster/test_runtime_filter_aggregation.py:129: in test_late_query_state_init self.assert_log_contains('impalad_node1', 'INFO', log_pattern, expected) common/impala_test_suite.py:1383: in assert_log_contains ", but found none." % (log_file_path, line_regex) E AssertionError: Expected at least one line in file /data0/jenkins/workspace/impala-cdwh-2024.0.18.0-core-asan-arm/repos/Impala/logs/custom_cluster_tests/impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-077e.vpc.cloudera.com.jenkins.log.INFO.20240603-025918.3562162 matching regex 'UpdateFilterFromRemote RPC called with remaining wait time', but found none.{noformat} Seen on an ARM job and an x86_64 job, so it is probably not an architecture specific thing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13125) Set of tests for exploration_strategy=exhaustive varies between python 2 and 3
Joe McDonnell created IMPALA-13125: -- Summary: Set of tests for exploration_strategy=exhaustive varies between python 2 and 3 Key: IMPALA-13125 URL: https://issues.apache.org/jira/browse/IMPALA-13125 Project: IMPALA Issue Type: Sub-task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell TLDR: Python 3 runs a different set of exhaustive tests than Python 2. Longer version: When looking into running Python 3 tests, I noticed that the set of tests running for the exhaustive tests is different for Python 2 vs Python 3. This was surprising. It turns out there is a distinction between run-tests.py's --exploration_strategy=exhaustive vs the --workload_exploration_strategy="functional-query:exhaustive" option. The exhaustive job is actually doing the latter. This means that individual function-query workload classes see cls.exploration_strategy() == "exhaustive", but the logic that generates the test vector still see exploration_strategy=core and it still uses pairwise generation. Code: {noformat} if exploration_strategy == 'exhaustive': return self.__generate_exhaustive_combinations() elif exploration_strategy in ['core', 'pairwise']: return self.__generate_pairwise_combinations(){noformat} [https://github.com/apache/impala/blob/master/tests/common/test_vector.py#L165-L168] Python 2 vs 3 changes the way dictionaries work, impacting the order of test dimensions and how it picks tests. So, the Python 3 exhaustive tests are different. This may expose latent bugs, because some combinations that meet the constraints are never actually run (e.g. some json encodings don't have the decimal_tiny table). We can work to make them behave similarly, using pytest's --collect-only option to look at the differences (and compare them to actual existing runs). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13125) Set of tests for exploration_strategy=exhaustive varies between python 2 and 3
Joe McDonnell created IMPALA-13125: -- Summary: Set of tests for exploration_strategy=exhaustive varies between python 2 and 3 Key: IMPALA-13125 URL: https://issues.apache.org/jira/browse/IMPALA-13125 Project: IMPALA Issue Type: Sub-task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell TLDR: Python 3 runs a different set of exhaustive tests than Python 2. Longer version: When looking into running Python 3 tests, I noticed that the set of tests running for the exhaustive tests is different for Python 2 vs Python 3. This was surprising. It turns out there is a distinction between run-tests.py's --exploration_strategy=exhaustive vs the --workload_exploration_strategy="functional-query:exhaustive" option. The exhaustive job is actually doing the latter. This means that individual function-query workload classes see cls.exploration_strategy() == "exhaustive", but the logic that generates the test vector still see exploration_strategy=core and it still uses pairwise generation. Code: {noformat} if exploration_strategy == 'exhaustive': return self.__generate_exhaustive_combinations() elif exploration_strategy in ['core', 'pairwise']: return self.__generate_pairwise_combinations(){noformat} [https://github.com/apache/impala/blob/master/tests/common/test_vector.py#L165-L168] Python 2 vs 3 changes the way dictionaries work, impacting the order of test dimensions and how it picks tests. So, the Python 3 exhaustive tests are different. This may expose latent bugs, because some combinations that meet the constraints are never actually run (e.g. some json encodings don't have the decimal_tiny table). We can work to make them behave similarly, using pytest's --collect-only option to look at the differences (and compare them to actual existing runs). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13124) Migrate tests that use the 'unittest' package to use normal pytest base class
Joe McDonnell created IMPALA-13124: -- Summary: Migrate tests that use the 'unittest' package to use normal pytest base class Key: IMPALA-13124 URL: https://issues.apache.org/jira/browse/IMPALA-13124 Project: IMPALA Issue Type: Sub-task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Assignee: Joe McDonnell Some tests use the 'unittest' package to be the base class of their tests. These can be run by pytest, but when running the tests with python 3, they fail with this message: {noformat} ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/runner.py:150: in __init__ self.result = func() ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/main.py:435: in _memocollect return self._memoizedcall('_collected', lambda: list(self.collect())) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/main.py:315: in _memoizedcall res = function() ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/main.py:435: in return self._memoizedcall('_collected', lambda: list(self.collect())) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/python.py:605: in collect return super(Module, self).collect() ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/python.py:459: in collect res = self.makeitem(name, obj) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/python.py:471: in makeitem collector=self, name=name, obj=obj) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/vendored_packages/pluggy.py:724: in __call__ return self._hookexec(self, self._nonwrappers + self._wrappers, kwargs) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/vendored_packages/pluggy.py:338: in _hookexec return self._inner_hookexec(hook, methods, kwargs) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/vendored_packages/pluggy.py:333: in _MultiCall(methods, kwargs, hook.spec_opts).execute() ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/vendored_packages/pluggy.py:595: in execute return _wrapped_call(hook_impl.function(*args), self.execute) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/vendored_packages/pluggy.py:249: in _wrapped_call wrap_controller.send(call_outcome) E RuntimeError: generator raised StopIteration{noformat} Converting them to use the regular pytest base classes works fine with python 3 (and also python 2). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13124) Migrate tests that use the 'unittest' package to use normal pytest base class
Joe McDonnell created IMPALA-13124: -- Summary: Migrate tests that use the 'unittest' package to use normal pytest base class Key: IMPALA-13124 URL: https://issues.apache.org/jira/browse/IMPALA-13124 Project: IMPALA Issue Type: Sub-task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Assignee: Joe McDonnell Some tests use the 'unittest' package to be the base class of their tests. These can be run by pytest, but when running the tests with python 3, they fail with this message: {noformat} ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/runner.py:150: in __init__ self.result = func() ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/main.py:435: in _memocollect return self._memoizedcall('_collected', lambda: list(self.collect())) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/main.py:315: in _memoizedcall res = function() ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/main.py:435: in return self._memoizedcall('_collected', lambda: list(self.collect())) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/python.py:605: in collect return super(Module, self).collect() ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/python.py:459: in collect res = self.makeitem(name, obj) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/python.py:471: in makeitem collector=self, name=name, obj=obj) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/vendored_packages/pluggy.py:724: in __call__ return self._hookexec(self, self._nonwrappers + self._wrappers, kwargs) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/vendored_packages/pluggy.py:338: in _hookexec return self._inner_hookexec(hook, methods, kwargs) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/vendored_packages/pluggy.py:333: in _MultiCall(methods, kwargs, hook.spec_opts).execute() ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/vendored_packages/pluggy.py:595: in execute return _wrapped_call(hook_impl.function(*args), self.execute) ../infra/python/env-gcc10.4.0-py3/lib/python3.7/site-packages/_pytest/vendored_packages/pluggy.py:249: in _wrapped_call wrap_controller.send(call_outcome) E RuntimeError: generator raised StopIteration{noformat} Converting them to use the regular pytest base classes works fine with python 3 (and also python 2). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13123) Add a way to run tests with python 3
Joe McDonnell created IMPALA-13123: -- Summary: Add a way to run tests with python 3 Key: IMPALA-13123 URL: https://issues.apache.org/jira/browse/IMPALA-13123 Project: IMPALA Issue Type: Sub-task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Assignee: Joe McDonnell As a first step towards switching to python 3, we need an option to run the tests using the toolchain python 3. For example, there could be an environment variable that tells tests/run-tests.py and bin/impala-py.test to use python 3. This can be combined with a first round of fixes to get a decent number of tests running and see what is broken. The fixes must be compatible with python 2, and the default will still be python 2. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13123) Add a way to run tests with python 3
Joe McDonnell created IMPALA-13123: -- Summary: Add a way to run tests with python 3 Key: IMPALA-13123 URL: https://issues.apache.org/jira/browse/IMPALA-13123 Project: IMPALA Issue Type: Sub-task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell Assignee: Joe McDonnell As a first step towards switching to python 3, we need an option to run the tests using the toolchain python 3. For example, there could be an environment variable that tells tests/run-tests.py and bin/impala-py.test to use python 3. This can be combined with a first round of fixes to get a decent number of tests running and see what is broken. The fixes must be compatible with python 2, and the default will still be python 2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12686) Build the toolchain with basic debug information (-g1)
[ https://issues.apache.org/jira/browse/IMPALA-12686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell reassigned IMPALA-12686: -- Assignee: Joe McDonnell > Build the toolchain with basic debug information (-g1) > -- > > Key: IMPALA-12686 > URL: https://issues.apache.org/jira/browse/IMPALA-12686 > Project: IMPALA > Issue Type: Improvement > Components: Infrastructure >Affects Versions: Impala 4.4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > Currently, we build most of the toolchain without debug information and > without "-fno-omit-frame-pointers". This makes it difficult to get reliable > stack traces that go through some of those libraries. We should build the > toolchain with basic debug information (-g1) to get reliable stack traces. > For some libraries, we want to compile with full debug information (-g) to > allow the ability to step through the code with a debugger. Currently, ORC > and Kudu (and others) are built with -g and should stay that way. We should > add -g for Thrift. > To save space, we should also enable compressed debug information (-gz) to > keep the sizes from growing too much (and reduce the size of existing debug > information). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13057) Incorporate tuple/slot information into the tuple cache key
[ https://issues.apache.org/jira/browse/IMPALA-13057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13057. Fix Version/s: Impala 4.5.0 Resolution: Fixed > Incorporate tuple/slot information into the tuple cache key > --- > > Key: IMPALA-13057 > URL: https://issues.apache.org/jira/browse/IMPALA-13057 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 4.4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.5.0 > > > Since the tuple and slot information is kept separately in the descriptor > table, it does not get incorporated into the PlanNode thrift used for the > tuple cache key. This means that the tuple cache can't distinguish between > these two queries: > {noformat} > select int_col1 from table; > select int_col2 from table;{noformat} > To solve this, the tuple/slot information needs to be incorporated into the > cache key. PlanNode::initThrift() walks through each tuple, so this is a good > place to serialize the TupleDescriptor/SlotDescriptors and incorporate it > into the hash. > The tuple ids and slot ids are global ids, so the value is influenced by the > entirety of the query. This is a problem for matching cache results across > different queries. As part of incorporating the tuple/slot information, we > should also add an ability to translate tuple/slot ids into ids local to a > subtree. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IMPALA-13057) Incorporate tuple/slot information into the tuple cache key
[ https://issues.apache.org/jira/browse/IMPALA-13057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13057. Fix Version/s: Impala 4.5.0 Resolution: Fixed > Incorporate tuple/slot information into the tuple cache key > --- > > Key: IMPALA-13057 > URL: https://issues.apache.org/jira/browse/IMPALA-13057 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 4.4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.5.0 > > > Since the tuple and slot information is kept separately in the descriptor > table, it does not get incorporated into the PlanNode thrift used for the > tuple cache key. This means that the tuple cache can't distinguish between > these two queries: > {noformat} > select int_col1 from table; > select int_col2 from table;{noformat} > To solve this, the tuple/slot information needs to be incorporated into the > cache key. PlanNode::initThrift() walks through each tuple, so this is a good > place to serialize the TupleDescriptor/SlotDescriptors and incorporate it > into the hash. > The tuple ids and slot ids are global ids, so the value is influenced by the > entirety of the query. This is a problem for matching cache results across > different queries. As part of incorporating the tuple/slot information, we > should also add an ability to translate tuple/slot ids into ids local to a > subtree. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13072) Toolchain: Add retries for uploading artifacts to the s3 buckets
[ https://issues.apache.org/jira/browse/IMPALA-13072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13072. Fix Version/s: Impala 4.5.0 Resolution: Fixed > Toolchain: Add retries for uploading artifacts to the s3 buckets > > > Key: IMPALA-13072 > URL: https://issues.apache.org/jira/browse/IMPALA-13072 > Project: IMPALA > Issue Type: Improvement > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.5.0 > > > On ARM toolchain builds, we have seen some failures to upload tarballs to s3: > {noformat} > 22:17:06 impala-toolchain-redhat8: Uploading > /mnt/build/llvm-5.0.1-asserts-p7-gcc-10.4.0.tar.gz to > s3://native-toolchain/build/33-f93e2c9a86/llvm/5.0.1-asserts-p7-gcc-10.4.0/llvm-5.0.1-asserts-p7-gcc-10.4.0-ec2-package-centos-8-aarch64.tar.gz > 22:17:06 impala-toolchain-redhat8: /mnt/functions.sh: line 385: 680012 > Segmentation fault (core dumped) aws s3 cp --only-show-errors > "${PACKAGE_FINAL_TGZ}" "${PACKAGE_S3_DESTINATION}"{noformat} > Since we do many uploads, even a relatively low failure rate can make it hard > to get a passing build. We should change the code to retry the upload. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13072) Toolchain: Add retries for uploading artifacts to the s3 buckets
[ https://issues.apache.org/jira/browse/IMPALA-13072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13072. Fix Version/s: Impala 4.5.0 Resolution: Fixed > Toolchain: Add retries for uploading artifacts to the s3 buckets > > > Key: IMPALA-13072 > URL: https://issues.apache.org/jira/browse/IMPALA-13072 > Project: IMPALA > Issue Type: Improvement > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.5.0 > > > On ARM toolchain builds, we have seen some failures to upload tarballs to s3: > {noformat} > 22:17:06 impala-toolchain-redhat8: Uploading > /mnt/build/llvm-5.0.1-asserts-p7-gcc-10.4.0.tar.gz to > s3://native-toolchain/build/33-f93e2c9a86/llvm/5.0.1-asserts-p7-gcc-10.4.0/llvm-5.0.1-asserts-p7-gcc-10.4.0-ec2-package-centos-8-aarch64.tar.gz > 22:17:06 impala-toolchain-redhat8: /mnt/functions.sh: line 385: 680012 > Segmentation fault (core dumped) aws s3 cp --only-show-errors > "${PACKAGE_FINAL_TGZ}" "${PACKAGE_S3_DESTINATION}"{noformat} > Since we do many uploads, even a relatively low failure rate can make it hard > to get a passing build. We should change the code to retry the upload. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IMPALA-13073) Toolchain builds should pass VERBOSE=1 into make
[ https://issues.apache.org/jira/browse/IMPALA-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell reassigned IMPALA-13073: -- Assignee: Joe McDonnell > Toolchain builds should pass VERBOSE=1 into make > > > Key: IMPALA-13073 > URL: https://issues.apache.org/jira/browse/IMPALA-13073 > Project: IMPALA > Issue Type: Improvement >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > It is useful to be able to examine the compilation flags for toolchain > components. Sometimes we want to add --fno-omit-frame-pointers or add debug > symbols with -g1 and verify that it actually gets set. For projects that use > CMake, the output often does not print the compile command. CMake can produce > a compilation database, but it is simpler to have make print the compilation > command by adding VERBOSE=1. The output isn't that big and output gets > redirected to a file, so it seems like we could leave it on by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-13072) Toolchain: Add retries for uploading artifacts to the s3 buckets
[ https://issues.apache.org/jira/browse/IMPALA-13072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851229#comment-17851229 ] Joe McDonnell commented on IMPALA-13072: Fixed by this commit: {noformat} commit f601ec33f2bcfaab19a46cff5fc6f0a90e22da8d Author: Joe McDonnell Date: Fri May 10 17:22:56 2024 -0700 IMPALA-13072: Add retries for s3 uploads to combat flakiness On ARM toolchain builds, we have seen some uploads to s3 fail with a segementation fault. Given the number of artifacts that the toolchain uploads, even a relatively low error rate can make it hard to get a passing build. This modifies the s3 upload code to retry up to 10 times to avoid this flakiness. Testing: - Ran an ARM toolchain build and saw the retry happen successfully - Ran a toolchain build with an invalid s3 bucket and verified it failed after 10 retries Change-Id: I95d858c99e965730303c2bfd90478ac5f68acf83 Reviewed-on: http://gerrit.cloudera.org:8080/21421 Reviewed-by: Michael Smith Reviewed-by: Laszlo Gaal Tested-by: Joe McDonnell {noformat} > Toolchain: Add retries for uploading artifacts to the s3 buckets > > > Key: IMPALA-13072 > URL: https://issues.apache.org/jira/browse/IMPALA-13072 > Project: IMPALA > Issue Type: Improvement > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > On ARM toolchain builds, we have seen some failures to upload tarballs to s3: > {noformat} > 22:17:06 impala-toolchain-redhat8: Uploading > /mnt/build/llvm-5.0.1-asserts-p7-gcc-10.4.0.tar.gz to > s3://native-toolchain/build/33-f93e2c9a86/llvm/5.0.1-asserts-p7-gcc-10.4.0/llvm-5.0.1-asserts-p7-gcc-10.4.0-ec2-package-centos-8-aarch64.tar.gz > 22:17:06 impala-toolchain-redhat8: /mnt/functions.sh: line 385: 680012 > Segmentation fault (core dumped) aws s3 cp --only-show-errors > "${PACKAGE_FINAL_TGZ}" "${PACKAGE_S3_DESTINATION}"{noformat} > Since we do many uploads, even a relatively low failure rate can make it hard > to get a passing build. We should change the code to retry the upload. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13111) impala-gdb.py's find-query-ids/find-fragment-instances return unusable query ids
[ https://issues.apache.org/jira/browse/IMPALA-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13111. Fix Version/s: Impala 4.5.0 Resolution: Fixed > impala-gdb.py's find-query-ids/find-fragment-instances return unusable query > ids > > > Key: IMPALA-13111 > URL: https://issues.apache.org/jira/browse/IMPALA-13111 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.5.0 > > > The gdb helpers in lib/python/impala_py_lib/gdb/impala-gdb.py provide > information about the queries / fragments running in a core file. However, > the query/fragment ids that it returns have issues with the signedness of the > integers: > {noformat} > (gdb) find-fragment-instances > Fragment Instance Id Thread IDs > -23b76c1699a831a1:279358680036 [117120] > -23b76c1699a831a1:279358680037 [117121] > -23b76c1699a831a1:279358680038 [117122] > .. > (gdb) find-query-ids > -3cbda1606b3ade7c:f170c4bd > -23b76c1699a831a1:27935868 > 68435df1364aa90f:1752944f > 3442ed6354c7355d:78c83d20{noformat} > The low values for find-query-ids don't have this problem, because it is > ANDed with 0x: > {noformat} > qid_low = format(int(qid_low, 16) & 0x, > 'x'){noformat} > We can fix the other locations by ANDing with 0x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13111) impala-gdb.py's find-query-ids/find-fragment-instances return unusable query ids
[ https://issues.apache.org/jira/browse/IMPALA-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13111. Fix Version/s: Impala 4.5.0 Resolution: Fixed > impala-gdb.py's find-query-ids/find-fragment-instances return unusable query > ids > > > Key: IMPALA-13111 > URL: https://issues.apache.org/jira/browse/IMPALA-13111 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.5.0 > > > The gdb helpers in lib/python/impala_py_lib/gdb/impala-gdb.py provide > information about the queries / fragments running in a core file. However, > the query/fragment ids that it returns have issues with the signedness of the > integers: > {noformat} > (gdb) find-fragment-instances > Fragment Instance Id Thread IDs > -23b76c1699a831a1:279358680036 [117120] > -23b76c1699a831a1:279358680037 [117121] > -23b76c1699a831a1:279358680038 [117122] > .. > (gdb) find-query-ids > -3cbda1606b3ade7c:f170c4bd > -23b76c1699a831a1:27935868 > 68435df1364aa90f:1752944f > 3442ed6354c7355d:78c83d20{noformat} > The low values for find-query-ids don't have this problem, because it is > ANDed with 0x: > {noformat} > qid_low = format(int(qid_low, 16) & 0x, > 'x'){noformat} > We can fix the other locations by ANDing with 0x. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IMPALA-13121) Move the toolchain to a newer version of ccache
[ https://issues.apache.org/jira/browse/IMPALA-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell reassigned IMPALA-13121: -- Assignee: Joe McDonnell > Move the toolchain to a newer version of ccache > --- > > Key: IMPALA-13121 > URL: https://issues.apache.org/jira/browse/IMPALA-13121 > Project: IMPALA > Issue Type: Task > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > The native-toolchain currently uses ccache 3.3.3. In a recent change adding > debug info, I ran into a case where the debug level was not what I expected. > I had added a -g0 at the end to turn off debug information for the cmake > build, but it still ended up with debug info. > The release notes for ccache 3.3.5 says this: > * Fixed a regression where the original order of debug options could be > lost. This reverts the “Improved parsing of {{-g*}} options” feature in > ccache 3.3. > [https://ccache.dev/releasenotes.html#_ccache_3_3_5] > I think I may have been hitting that. We should upgrade ccache to a more > recent version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-13111) impala-gdb.py's find-query-ids/find-fragment-instances return unusable query ids
[ https://issues.apache.org/jira/browse/IMPALA-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell reassigned IMPALA-13111: -- Assignee: Joe McDonnell > impala-gdb.py's find-query-ids/find-fragment-instances return unusable query > ids > > > Key: IMPALA-13111 > URL: https://issues.apache.org/jira/browse/IMPALA-13111 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.5.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > The gdb helpers in lib/python/impala_py_lib/gdb/impala-gdb.py provide > information about the queries / fragments running in a core file. However, > the query/fragment ids that it returns have issues with the signedness of the > integers: > {noformat} > (gdb) find-fragment-instances > Fragment Instance Id Thread IDs > -23b76c1699a831a1:279358680036 [117120] > -23b76c1699a831a1:279358680037 [117121] > -23b76c1699a831a1:279358680038 [117122] > .. > (gdb) find-query-ids > -3cbda1606b3ade7c:f170c4bd > -23b76c1699a831a1:27935868 > 68435df1364aa90f:1752944f > 3442ed6354c7355d:78c83d20{noformat} > The low values for find-query-ids don't have this problem, because it is > ANDed with 0x: > {noformat} > qid_low = format(int(qid_low, 16) & 0x, > 'x'){noformat} > We can fix the other locations by ANDing with 0x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13121) Move the toolchain to a newer version of ccache
Joe McDonnell created IMPALA-13121: -- Summary: Move the toolchain to a newer version of ccache Key: IMPALA-13121 URL: https://issues.apache.org/jira/browse/IMPALA-13121 Project: IMPALA Issue Type: Task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell The native-toolchain currently uses ccache 3.3.3. In a recent change adding debug info, I ran into a case where the debug level was not what I expected. I had added a -g0 at the end to turn off debug information for the cmake build, but it still ended up with debug info. The release notes for ccache 3.3.5 says this: * Fixed a regression where the original order of debug options could be lost. This reverts the “Improved parsing of {{-g*}} options” feature in ccache 3.3. [https://ccache.dev/releasenotes.html#_ccache_3_3_5] I think I may have been hitting that. We should upgrade ccache to a more recent version. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13121) Move the toolchain to a newer version of ccache
Joe McDonnell created IMPALA-13121: -- Summary: Move the toolchain to a newer version of ccache Key: IMPALA-13121 URL: https://issues.apache.org/jira/browse/IMPALA-13121 Project: IMPALA Issue Type: Task Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell The native-toolchain currently uses ccache 3.3.3. In a recent change adding debug info, I ran into a case where the debug level was not what I expected. I had added a -g0 at the end to turn off debug information for the cmake build, but it still ended up with debug info. The release notes for ccache 3.3.5 says this: * Fixed a regression where the original order of debug options could be lost. This reverts the “Improved parsing of {{-g*}} options” feature in ccache 3.3. [https://ccache.dev/releasenotes.html#_ccache_3_3_5] I think I may have been hitting that. We should upgrade ccache to a more recent version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13111) impala-gdb.py's find-query-ids/find-fragment-instances return unusable query ids
Joe McDonnell created IMPALA-13111: -- Summary: impala-gdb.py's find-query-ids/find-fragment-instances return unusable query ids Key: IMPALA-13111 URL: https://issues.apache.org/jira/browse/IMPALA-13111 Project: IMPALA Issue Type: Bug Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell The gdb helpers in lib/python/impala_py_lib/gdb/impala-gdb.py provide information about the queries / fragments running in a core file. However, the query/fragment ids that it returns have issues with the signedness of the integers: {noformat} (gdb) find-fragment-instances Fragment Instance Id Thread IDs -23b76c1699a831a1:279358680036 [117120] -23b76c1699a831a1:279358680037 [117121] -23b76c1699a831a1:279358680038 [117122] .. (gdb) find-query-ids -3cbda1606b3ade7c:f170c4bd -23b76c1699a831a1:27935868 68435df1364aa90f:1752944f 3442ed6354c7355d:78c83d20{noformat} The low values for find-query-ids don't have this problem, because it is ANDed with 0x: {noformat} qid_low = format(int(qid_low, 16) & 0x, 'x'){noformat} We can fix the other locations by ANDing with 0x. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13111) impala-gdb.py's find-query-ids/find-fragment-instances return unusable query ids
Joe McDonnell created IMPALA-13111: -- Summary: impala-gdb.py's find-query-ids/find-fragment-instances return unusable query ids Key: IMPALA-13111 URL: https://issues.apache.org/jira/browse/IMPALA-13111 Project: IMPALA Issue Type: Bug Components: Infrastructure Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell The gdb helpers in lib/python/impala_py_lib/gdb/impala-gdb.py provide information about the queries / fragments running in a core file. However, the query/fragment ids that it returns have issues with the signedness of the integers: {noformat} (gdb) find-fragment-instances Fragment Instance Id Thread IDs -23b76c1699a831a1:279358680036 [117120] -23b76c1699a831a1:279358680037 [117121] -23b76c1699a831a1:279358680038 [117122] .. (gdb) find-query-ids -3cbda1606b3ade7c:f170c4bd -23b76c1699a831a1:27935868 68435df1364aa90f:1752944f 3442ed6354c7355d:78c83d20{noformat} The low values for find-query-ids don't have this problem, because it is ANDed with 0x: {noformat} qid_low = format(int(qid_low, 16) & 0x, 'x'){noformat} We can fix the other locations by ANDing with 0x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13020) catalog-topic updates >2GB do not work due to Thrift's max message size
[ https://issues.apache.org/jira/browse/IMPALA-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13020. Fix Version/s: Impala 4.5.0 Resolution: Fixed > catalog-topic updates >2GB do not work due to Thrift's max message size > --- > > Key: IMPALA-13020 > URL: https://issues.apache.org/jira/browse/IMPALA-13020 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 4.2.0, Impala 4.3.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > Fix For: Impala 4.5.0 > > > Thrift 0.16.0 added a max message size to protect against malicious packets > that can consume a large amount of memory on the receiver side. This max > message size is a signed 32-bit integer, so it maxes out at 2GB (which we set > via thrift_rpc_max_message_size). > In catalog v1, the catalog-update statestore topic can become larger than 2GB > when there are a large number of tables / partitions / files. If this happens > and an Impala coordinator needs to start up (or needs a full topic update for > any other reason), it is expecting the statestore to send it the full topic > update, but the coordinator actually can't process the message. The > deserialization of the message hits the 2GB max message size limit and fails. > On the statestore side, it shows this message: > {noformat} > I0418 16:54:51.727290 3844140 statestore.cc:507] Preparing initial > catalog-update topic update for > impa...@mcdonnellthrift.vpc.cloudera.com:27000. Size = 2.27 GB > I0418 16:54:53.889446 3844140 thrift-util.cc:198] TSocket::write_partial() > send() : Broken pipe > I0418 16:54:53.889488 3844140 client-cache.cc:82] ReopenClient(): re-creating > client for mcdonnellthrift.vpc.cloudera.com:23000 > I0418 16:54:53.889493 3844140 thrift-util.cc:198] TSocket::write_partial() > send() : Broken pipe > I0418 16:54:53.889503 3844140 thrift-client.cc:116] Error closing connection > to: mcdonnellthrift.vpc.cloudera.com:23000, ignoring (write() send(): Broken > pipe) > I0418 16:54:56.052882 3844140 thrift-util.cc:198] TSocket::write_partial() > send() : Broken pipe > I0418 16:54:56.052932 3844140 client-cache.h:363] RPC Error: Client for > mcdonnellthrift.vpc.cloudera.com:23000 hit an unexpected exception: write() > send(): Broken pipe, type: N6apache6thrift9transport19TTransportExceptionE, > rpc: N6impala20TUpdateStateResponseE, send: not done > I0418 16:54:56.052937 3844140 client-cache.cc:174] Broken Connection, destroy > client for mcdonnellthrift.vpc.cloudera.com:23000{noformat} > On the Impala side, it doesn't give a good error, but we see this: > {noformat} > I0418 16:54:53.889683 3214537 TAcceptQueueServer.cpp:355] New connection to > server StatestoreSubscriber from client > I0418 16:54:54.080694 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 110 > I0418 16:54:56.080920 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 111 > I0418 16:54:58.081131 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 112 > I0418 16:55:00.081358 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 113{noformat} > With a patch Thrift that allows an int64_t max message size and setting that > to a larger value, the Impala was able to start up (even without restarting > the statestored). > Some clusters that upgrade to a newer version may hit this, as Thrift didn't > use to enforce this limit, so this is something we should fix to avoid > upgrade issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IMPALA-13020) catalog-topic updates >2GB do not work due to Thrift's max message size
[ https://issues.apache.org/jira/browse/IMPALA-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell reassigned IMPALA-13020: -- Assignee: Joe McDonnell > catalog-topic updates >2GB do not work due to Thrift's max message size > --- > > Key: IMPALA-13020 > URL: https://issues.apache.org/jira/browse/IMPALA-13020 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 4.2.0, Impala 4.3.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > > Thrift 0.16.0 added a max message size to protect against malicious packets > that can consume a large amount of memory on the receiver side. This max > message size is a signed 32-bit integer, so it maxes out at 2GB (which we set > via thrift_rpc_max_message_size). > In catalog v1, the catalog-update statestore topic can become larger than 2GB > when there are a large number of tables / partitions / files. If this happens > and an Impala coordinator needs to start up (or needs a full topic update for > any other reason), it is expecting the statestore to send it the full topic > update, but the coordinator actually can't process the message. The > deserialization of the message hits the 2GB max message size limit and fails. > On the statestore side, it shows this message: > {noformat} > I0418 16:54:51.727290 3844140 statestore.cc:507] Preparing initial > catalog-update topic update for > impa...@mcdonnellthrift.vpc.cloudera.com:27000. Size = 2.27 GB > I0418 16:54:53.889446 3844140 thrift-util.cc:198] TSocket::write_partial() > send() : Broken pipe > I0418 16:54:53.889488 3844140 client-cache.cc:82] ReopenClient(): re-creating > client for mcdonnellthrift.vpc.cloudera.com:23000 > I0418 16:54:53.889493 3844140 thrift-util.cc:198] TSocket::write_partial() > send() : Broken pipe > I0418 16:54:53.889503 3844140 thrift-client.cc:116] Error closing connection > to: mcdonnellthrift.vpc.cloudera.com:23000, ignoring (write() send(): Broken > pipe) > I0418 16:54:56.052882 3844140 thrift-util.cc:198] TSocket::write_partial() > send() : Broken pipe > I0418 16:54:56.052932 3844140 client-cache.h:363] RPC Error: Client for > mcdonnellthrift.vpc.cloudera.com:23000 hit an unexpected exception: write() > send(): Broken pipe, type: N6apache6thrift9transport19TTransportExceptionE, > rpc: N6impala20TUpdateStateResponseE, send: not done > I0418 16:54:56.052937 3844140 client-cache.cc:174] Broken Connection, destroy > client for mcdonnellthrift.vpc.cloudera.com:23000{noformat} > On the Impala side, it doesn't give a good error, but we see this: > {noformat} > I0418 16:54:53.889683 3214537 TAcceptQueueServer.cpp:355] New connection to > server StatestoreSubscriber from client > I0418 16:54:54.080694 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 110 > I0418 16:54:56.080920 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 111 > I0418 16:54:58.081131 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 112 > I0418 16:55:00.081358 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 113{noformat} > With a patch Thrift that allows an int64_t max message size and setting that > to a larger value, the Impala was able to start up (even without restarting > the statestored). > Some clusters that upgrade to a newer version may hit this, as Thrift didn't > use to enforce this limit, so this is something we should fix to avoid > upgrade issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-13020) catalog-topic updates >2GB do not work due to Thrift's max message size
[ https://issues.apache.org/jira/browse/IMPALA-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-13020. Fix Version/s: Impala 4.5.0 Resolution: Fixed > catalog-topic updates >2GB do not work due to Thrift's max message size > --- > > Key: IMPALA-13020 > URL: https://issues.apache.org/jira/browse/IMPALA-13020 > Project: IMPALA > Issue Type: Bug > Components: Backend >Affects Versions: Impala 4.2.0, Impala 4.3.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Critical > Fix For: Impala 4.5.0 > > > Thrift 0.16.0 added a max message size to protect against malicious packets > that can consume a large amount of memory on the receiver side. This max > message size is a signed 32-bit integer, so it maxes out at 2GB (which we set > via thrift_rpc_max_message_size). > In catalog v1, the catalog-update statestore topic can become larger than 2GB > when there are a large number of tables / partitions / files. If this happens > and an Impala coordinator needs to start up (or needs a full topic update for > any other reason), it is expecting the statestore to send it the full topic > update, but the coordinator actually can't process the message. The > deserialization of the message hits the 2GB max message size limit and fails. > On the statestore side, it shows this message: > {noformat} > I0418 16:54:51.727290 3844140 statestore.cc:507] Preparing initial > catalog-update topic update for > impa...@mcdonnellthrift.vpc.cloudera.com:27000. Size = 2.27 GB > I0418 16:54:53.889446 3844140 thrift-util.cc:198] TSocket::write_partial() > send() : Broken pipe > I0418 16:54:53.889488 3844140 client-cache.cc:82] ReopenClient(): re-creating > client for mcdonnellthrift.vpc.cloudera.com:23000 > I0418 16:54:53.889493 3844140 thrift-util.cc:198] TSocket::write_partial() > send() : Broken pipe > I0418 16:54:53.889503 3844140 thrift-client.cc:116] Error closing connection > to: mcdonnellthrift.vpc.cloudera.com:23000, ignoring (write() send(): Broken > pipe) > I0418 16:54:56.052882 3844140 thrift-util.cc:198] TSocket::write_partial() > send() : Broken pipe > I0418 16:54:56.052932 3844140 client-cache.h:363] RPC Error: Client for > mcdonnellthrift.vpc.cloudera.com:23000 hit an unexpected exception: write() > send(): Broken pipe, type: N6apache6thrift9transport19TTransportExceptionE, > rpc: N6impala20TUpdateStateResponseE, send: not done > I0418 16:54:56.052937 3844140 client-cache.cc:174] Broken Connection, destroy > client for mcdonnellthrift.vpc.cloudera.com:23000{noformat} > On the Impala side, it doesn't give a good error, but we see this: > {noformat} > I0418 16:54:53.889683 3214537 TAcceptQueueServer.cpp:355] New connection to > server StatestoreSubscriber from client > I0418 16:54:54.080694 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 110 > I0418 16:54:56.080920 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 111 > I0418 16:54:58.081131 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 112 > I0418 16:55:00.081358 3214136 Frontend.java:1837] Waiting for local catalog > to be initialized, attempt: 113{noformat} > With a patch Thrift that allows an int64_t max message size and setting that > to a larger value, the Impala was able to start up (even without restarting > the statestored). > Some clusters that upgrade to a newer version may hit this, as Thrift didn't > use to enforce this limit, so this is something we should fix to avoid > upgrade issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13082) Use separate versions for jackson-databind vs jackson-core, etc.
Joe McDonnell created IMPALA-13082: -- Summary: Use separate versions for jackson-databind vs jackson-core, etc. Key: IMPALA-13082 URL: https://issues.apache.org/jira/browse/IMPALA-13082 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell We have a single jackson-databind.version property defined populated by the IMPALA_JACKSON_DATABIND_VERSION. This currently sets the version for jackson-databind as well as other jackson libraries like jackson-core. Sometimes there is a jackson-databind patch release without a release of other jackson libraries. For example, there is a jackson-databind 2.12.7.1, but there is no jackson-core 2.12.7.1. There is only jackson-core 2.12.7. To handle these patch scenarios, it is useful to split out the jackson-databind version from the version for other jackson libraries. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-13082) Use separate versions for jackson-databind vs jackson-core, etc.
Joe McDonnell created IMPALA-13082: -- Summary: Use separate versions for jackson-databind vs jackson-core, etc. Key: IMPALA-13082 URL: https://issues.apache.org/jira/browse/IMPALA-13082 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.5.0 Reporter: Joe McDonnell We have a single jackson-databind.version property defined populated by the IMPALA_JACKSON_DATABIND_VERSION. This currently sets the version for jackson-databind as well as other jackson libraries like jackson-core. Sometimes there is a jackson-databind patch release without a release of other jackson libraries. For example, there is a jackson-databind 2.12.7.1, but there is no jackson-core 2.12.7.1. There is only jackson-core 2.12.7. To handle these patch scenarios, it is useful to split out the jackson-databind version from the version for other jackson libraries. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13073) Toolchain builds should pass VERBOSE=1 into make
Joe McDonnell created IMPALA-13073: -- Summary: Toolchain builds should pass VERBOSE=1 into make Key: IMPALA-13073 URL: https://issues.apache.org/jira/browse/IMPALA-13073 Project: IMPALA Issue Type: Improvement Reporter: Joe McDonnell It is useful to be able to examine the compilation flags for toolchain components. Sometimes we want to add --fno-omit-frame-pointers or add debug symbols with -g1 and verify that it actually gets set. For projects that use CMake, the output often does not print the compile command. CMake can produce a compilation database, but it is simpler to have make print the compilation command by adding VERBOSE=1. The output isn't that big and output gets redirected to a file, so it seems like we could leave it on by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13073) Toolchain builds should pass VERBOSE=1 into make
Joe McDonnell created IMPALA-13073: -- Summary: Toolchain builds should pass VERBOSE=1 into make Key: IMPALA-13073 URL: https://issues.apache.org/jira/browse/IMPALA-13073 Project: IMPALA Issue Type: Improvement Reporter: Joe McDonnell It is useful to be able to examine the compilation flags for toolchain components. Sometimes we want to add --fno-omit-frame-pointers or add debug symbols with -g1 and verify that it actually gets set. For projects that use CMake, the output often does not print the compile command. CMake can produce a compilation database, but it is simpler to have make print the compilation command by adding VERBOSE=1. The output isn't that big and output gets redirected to a file, so it seems like we could leave it on by default. -- This message was sent by Atlassian Jira (v8.20.10#820010)