[jira] [Commented] (IMPALA-13202) KRPC flags used by libkudu_client.so can't be configured

2024-07-08 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863987#comment-17863987
 ] 

Quanlong Huang commented on IMPALA-13202:
-

Debug in gdb, I can verify libkudu_client.so is using its own methods and flags.

There are two variables of FLAGS_rpc_max_message_size:
{code:cpp}
(gdb) info variables FLAGS_rpc_max_message_size$
All variables matching regular expression "FLAGS_rpc_max_message_size$":

File /home/quanlong/workspace/Impala/be/src/kudu/rpc/transfer.cc:
48: google::int64 fLI64::FLAGS_rpc_max_message_size;

File /mnt/source/kudu/kudu-e742f86f6d/src/kudu/rpc/transfer.cc:
46: google::int64 fLI64::FLAGS_rpc_max_message_size;{code}
The second one comes from libkudu_client.so. The current Kudu version used in 
the Impala master branch is e742f86f6d (corresponding to kudu-1.17 release). 
Here is where the flag is used:
{code:cpp}
102 Status InboundTransfer::ReceiveBuffer(Socket* socket, faststring* extra_4) {
...
130 if (PREDICT_FALSE(total_length_ > FLAGS_rpc_max_message_size)) {
131   return Status::NetworkError(Substitute(
132   "RPC frame had a length of $0, but we only support messages up to 
$1 bytes "
133   "long.", total_length_, FLAGS_rpc_max_message_size));
134 }{code}
[https://github.com/apache/kudu/blob/e742f86f6d8e687dd02d9891f33e068477163016/src/kudu/rpc/transfer.cc#L130]
Add a breakpoint in that source file where the code uses this flag.
{noformat}
(gdb) b /mnt/source/kudu/kudu-e742f86f6d/src/kudu/rpc/transfer.cc:130{noformat}
Continue in gdb and run the query in Impala. When the breakpoint is hitted:
{code:cpp}
Thread 276 "rpc reactor-250" hit Breakpoint 1, 
kudu::rpc::InboundTransfer::ReceiveBuffer (this=0xd03cfc0, socket=0x14b4ed20, 
extra_4=0x7fc72c74e8e0) at 
/mnt/source/kudu/kudu-e742f86f6d/src/kudu/rpc/transfer.cc:130
130 if (PREDICT_FALSE(total_length_ > FLAGS_rpc_max_message_size)) {
(gdb) x/i $pc
=> 0x7fc7dd68bf49 :cmp%r9,%rdx
(gdb) p $rdx
$1 = 53477464
(gdb) p $r9
$2 = 52428800{code}
The assembly code is comparing two registers. Their values match what we see in 
the error message. 52428800 is the unmodified default value of 
FLAGS_rpc_max_message_size.

Looking into the assembly codes, register r9 is loaded from memory address 
0x7fc7ddd631d8 which is the hidden variable FLAGS_rpc_max_message_size:
{code:java}
   lea0x6d729e(%rip),%rdx# 0x7fc7ddd631d8 
<_ZN5fLI6426FLAGS_rpc_max_message_sizeE>
   mov(%rax),%ecx
   mov(%rdx),%r9
  
   bswap  %ecx
   lea0x4(%rcx),%edi
  
   mov%edi,%edx
   mov%edi,0x38(%rbx)   
  
   cmp%r9,%rdx {code}
 Print the variable shows the global one used in impalad. But print the value 
used by libkudu_client.so shows 52428800:
{noformat}
(gdb) p FLAGS_rpc_max_message_size
$25 = 2147483647
(gdb) p *((int64_t*)0x7fc7ddd631d8)
$26 = 52428800{noformat}

> KRPC flags used by libkudu_client.so can't be configured
> 
>
> Key: IMPALA-13202
> URL: https://issues.apache.org/jira/browse/IMPALA-13202
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Quanlong Huang
>Priority: Critical
> Attachments: data.parquet
>
>
> The way Impala integrates with KRPC is porting the KRPC codes into the Impala 
> code base. Flags and methods of KRPC are defined as GLOBAL in the impalad 
> executable. libkudu_client.so also compiles from the same KRPC codes and have 
> duplicate flags and methods defined as HIDDEN.
> To be specifit, both the impalad executable and libkudu_client.so have the 
> symbol for kudu::rpc::InboundTransfer::ReceiveBuffer() 
> {noformat}
> $ readelf -s --wide be/build/latest/service/impalad | grep ReceiveBuffer
>  8: 022f5c88  1936 FUNCGLOBAL DEFAULT   13 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
>  81380: 022f5c88  1936 FUNCGLOBAL DEFAULT   13 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
> $ readelf -s --wide 
> toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so
>  | grep ReceiveBuffer
>   1601: 00086e4a   108 FUNCLOCAL  DEFAULT   12 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE.cold
>  11905: 001fec60  2076 FUNCLOCAL  HIDDEN12 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
> $ c++filt 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
> kudu::rpc::InboundTransfer::ReceiveBuffer(kudu::Socket*, kudu::faststring*) 
> {noformat}

[jira] [Commented] (IMPALA-13202) KRPC flags used by libkudu_client.so can't be configured

2024-07-08 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863983#comment-17863983
 ] 

Quanlong Huang commented on IMPALA-13202:
-

YoungJi Nam figured out a way to reproduce the issue by using the attached 
[^data.parquet]. We need the following Kudu configs:
{code:java}
--unlock_unsafe_flags=true
--max_cell_size_bytes=1073741824
--max_cfile_block_size=1073741824{code}
In the Impala dev env, add them in 
testdata/cluster/cdh7/node-*/etc/kudu/tserver.conf
Then start Impala cluster with the following configs:
{code:java}
-kudu_mutation_buffer_size=56477399
-kudu_error_buffer_size=56477399
-rpc_max_message_size=2147483647{code}
In the Impala dev env, the command is
{code:java}
bin/start-impala-cluster.py -r 
--impalad_args="-kudu_mutation_buffer_size=56477399 
-kudu_error_buffer_size=56477399 -rpc_max_message_size=2147483647"{code}
Create a Parquet table and a Kudu table
{code:sql}
create external table test_parquet (str string) stored as parquet;

create table test_kudu_large (
  id int,
  str string,
  primary key(id)
) stored as kudu;{code}
Put the file [^data.parquet] into the location of test_parquet table and 
REFRESH the table. Then INSERT the value into the Kudu table.
{code:sql}
insert into test_kudu_large select 1, str from test_parquet;{code}
Run a SELECT query on the Kudu table can see the error messages in impalad logs.
{code:sql}
select * from test_kudu_large;{code}

> KRPC flags used by libkudu_client.so can't be configured
> 
>
> Key: IMPALA-13202
> URL: https://issues.apache.org/jira/browse/IMPALA-13202
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Quanlong Huang
>Priority: Critical
> Attachments: data.parquet
>
>
> The way Impala integrates with KRPC is porting the KRPC codes into the Impala 
> code base. Flags and methods of KRPC are defined as GLOBAL in the impalad 
> executable. libkudu_client.so also compiles from the same KRPC codes and have 
> duplicate flags and methods defined as HIDDEN.
> To be specifit, both the impalad executable and libkudu_client.so have the 
> symbol for kudu::rpc::InboundTransfer::ReceiveBuffer() 
> {noformat}
> $ readelf -s --wide be/build/latest/service/impalad | grep ReceiveBuffer
>  8: 022f5c88  1936 FUNCGLOBAL DEFAULT   13 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
>  81380: 022f5c88  1936 FUNCGLOBAL DEFAULT   13 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
> $ readelf -s --wide 
> toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so
>  | grep ReceiveBuffer
>   1601: 00086e4a   108 FUNCLOCAL  DEFAULT   12 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE.cold
>  11905: 001fec60  2076 FUNCLOCAL  HIDDEN12 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
> $ c++filt 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
> kudu::rpc::InboundTransfer::ReceiveBuffer(kudu::Socket*, kudu::faststring*) 
> {noformat}
> KRPC flags like rpc_max_message_size are also defined in both the impalad 
> executable and libkudu_client.so:
> {noformat}
> $ readelf -s --wide be/build/latest/service/impalad | grep 
> FLAGS_rpc_max_message_size
>  14380: 06006738 8 OBJECT  GLOBAL DEFAULT   30 
> _ZN5fLI6426FLAGS_rpc_max_message_sizeE
>  80396: 06006741 1 OBJECT  GLOBAL DEFAULT   30 
> _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
>  81399: 06006741 1 OBJECT  GLOBAL DEFAULT   30 
> _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
> 117873: 06006738 8 OBJECT  GLOBAL DEFAULT   30 
> _ZN5fLI6426FLAGS_rpc_max_message_sizeE
> $ readelf -s --wide 
> toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so
>  | grep FLAGS_rpc_max_message_size
>  11882: 008d61e1 1 OBJECT  LOCAL  HIDDEN27 
> _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
>  11906: 008d61d8 8 OBJECT  LOCAL  DEFAULT   27 
> _ZN5fLI6426FLAGS_rpc_max_message_sizeE
> $ c++filt _ZN5fLI6426FLAGS_rpc_max_message_sizeE
> fLI64::FLAGS_rpc_max_message_size {noformat}
> libkudu_client.so uses its own methods and flags. The flags are HIDDEN so 
> can't be modified by Impala codes. E.g. IMPALA-4874 bumps 
> FLAGS_rpc_max_message_size to 2GB in RpcMgr::Init(), but the HIDDEN variable 
> FLAGS_rpc_max_message_size used in libkudu_client.so is still the default 
> value 50MB (52428800). We've seen error messages like this in the master 
> branch:
> {code:java}
> I0708 10:23:31.784974  2943 meta_cache.cc:294] 
> c243bda4702a5ab9:0ba93d240001] tablet 0c8f3446538449ee9d3df5056afe775e: 
> replica 

[jira] [Updated] (IMPALA-13202) KRPC flags used by libkudu_client.so can't be configured

2024-07-08 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13202:

Attachment: data.parquet

> KRPC flags used by libkudu_client.so can't be configured
> 
>
> Key: IMPALA-13202
> URL: https://issues.apache.org/jira/browse/IMPALA-13202
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Quanlong Huang
>Priority: Critical
> Attachments: data.parquet
>
>
> The way Impala integrates with KRPC is porting the KRPC codes into the Impala 
> code base. Flags and methods of KRPC are defined as GLOBAL in the impalad 
> executable. libkudu_client.so also compiles from the same KRPC codes and have 
> duplicate flags and methods defined as HIDDEN.
> To be specifit, both the impalad executable and libkudu_client.so have the 
> symbol for kudu::rpc::InboundTransfer::ReceiveBuffer() 
> {noformat}
> $ readelf -s --wide be/build/latest/service/impalad | grep ReceiveBuffer
>  8: 022f5c88  1936 FUNCGLOBAL DEFAULT   13 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
>  81380: 022f5c88  1936 FUNCGLOBAL DEFAULT   13 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
> $ readelf -s --wide 
> toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so
>  | grep ReceiveBuffer
>   1601: 00086e4a   108 FUNCLOCAL  DEFAULT   12 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE.cold
>  11905: 001fec60  2076 FUNCLOCAL  HIDDEN12 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
> $ c++filt 
> _ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
> kudu::rpc::InboundTransfer::ReceiveBuffer(kudu::Socket*, kudu::faststring*) 
> {noformat}
> KRPC flags like rpc_max_message_size are also defined in both the impalad 
> executable and libkudu_client.so:
> {noformat}
> $ readelf -s --wide be/build/latest/service/impalad | grep 
> FLAGS_rpc_max_message_size
>  14380: 06006738 8 OBJECT  GLOBAL DEFAULT   30 
> _ZN5fLI6426FLAGS_rpc_max_message_sizeE
>  80396: 06006741 1 OBJECT  GLOBAL DEFAULT   30 
> _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
>  81399: 06006741 1 OBJECT  GLOBAL DEFAULT   30 
> _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
> 117873: 06006738 8 OBJECT  GLOBAL DEFAULT   30 
> _ZN5fLI6426FLAGS_rpc_max_message_sizeE
> $ readelf -s --wide 
> toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so
>  | grep FLAGS_rpc_max_message_size
>  11882: 008d61e1 1 OBJECT  LOCAL  HIDDEN27 
> _ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
>  11906: 008d61d8 8 OBJECT  LOCAL  DEFAULT   27 
> _ZN5fLI6426FLAGS_rpc_max_message_sizeE
> $ c++filt _ZN5fLI6426FLAGS_rpc_max_message_sizeE
> fLI64::FLAGS_rpc_max_message_size {noformat}
> libkudu_client.so uses its own methods and flags. The flags are HIDDEN so 
> can't be modified by Impala codes. E.g. IMPALA-4874 bumps 
> FLAGS_rpc_max_message_size to 2GB in RpcMgr::Init(), but the HIDDEN variable 
> FLAGS_rpc_max_message_size used in libkudu_client.so is still the default 
> value 50MB (52428800). We've seen error messages like this in the master 
> branch:
> {code:java}
> I0708 10:23:31.784974  2943 meta_cache.cc:294] 
> c243bda4702a5ab9:0ba93d240001] tablet 0c8f3446538449ee9d3df5056afe775e: 
> replica e0e1db54dab74f208e37ea1b975595e5 (127.0.0.1:31202) has failed: 
> Network error: TS failed: RPC frame had a length of 53477464, but we only 
> support messages up to 52428800 bytes long.{code}
> CC [~joemcdonnell] [~wzhou] [~aserbin] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13202) KRPC flags used by libkudu_client.so can't be configured

2024-07-08 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13202:
---

 Summary: KRPC flags used by libkudu_client.so can't be configured
 Key: IMPALA-13202
 URL: https://issues.apache.org/jira/browse/IMPALA-13202
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Reporter: Quanlong Huang


The way Impala integrates with KRPC is porting the KRPC codes into the Impala 
code base. Flags and methods of KRPC are defined as GLOBAL in the impalad 
executable. libkudu_client.so also compiles from the same KRPC codes and have 
duplicate flags and methods defined as HIDDEN.

To be specifit, both the impalad executable and libkudu_client.so have the 
symbol for kudu::rpc::InboundTransfer::ReceiveBuffer() 
{noformat}
$ readelf -s --wide be/build/latest/service/impalad | grep ReceiveBuffer
 8: 022f5c88  1936 FUNCGLOBAL DEFAULT   13 
_ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
 81380: 022f5c88  1936 FUNCGLOBAL DEFAULT   13 
_ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE

$ readelf -s --wide 
toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so
 | grep ReceiveBuffer
  1601: 00086e4a   108 FUNCLOCAL  DEFAULT   12 
_ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE.cold
 11905: 001fec60  2076 FUNCLOCAL  HIDDEN12 
_ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE

$ c++filt 
_ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
kudu::rpc::InboundTransfer::ReceiveBuffer(kudu::Socket*, kudu::faststring*) 
{noformat}
KRPC flags like rpc_max_message_size are also defined in both the impalad 
executable and libkudu_client.so:
{noformat}
$ readelf -s --wide be/build/latest/service/impalad | grep 
FLAGS_rpc_max_message_size
 14380: 06006738 8 OBJECT  GLOBAL DEFAULT   30 
_ZN5fLI6426FLAGS_rpc_max_message_sizeE
 80396: 06006741 1 OBJECT  GLOBAL DEFAULT   30 
_ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
 81399: 06006741 1 OBJECT  GLOBAL DEFAULT   30 
_ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
117873: 06006738 8 OBJECT  GLOBAL DEFAULT   30 
_ZN5fLI6426FLAGS_rpc_max_message_sizeE

$ readelf -s --wide 
toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so
 | grep FLAGS_rpc_max_message_size
 11882: 008d61e1 1 OBJECT  LOCAL  HIDDEN27 
_ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
 11906: 008d61d8 8 OBJECT  LOCAL  DEFAULT   27 
_ZN5fLI6426FLAGS_rpc_max_message_sizeE

$ c++filt _ZN5fLI6426FLAGS_rpc_max_message_sizeE
fLI64::FLAGS_rpc_max_message_size {noformat}
libkudu_client.so uses its own methods and flags. The flags are HIDDEN so can't 
be modified by Impala codes. E.g. IMPALA-4874 bumps FLAGS_rpc_max_message_size 
to 2GB in RpcMgr::Init(), but the HIDDEN variable FLAGS_rpc_max_message_size 
used in libkudu_client.so is still the default value 50MB (52428800). We've 
seen error messages like this in the master branch:
{code:java}
I0708 10:23:31.784974  2943 meta_cache.cc:294] 
c243bda4702a5ab9:0ba93d240001] tablet 0c8f3446538449ee9d3df5056afe775e: 
replica e0e1db54dab74f208e37ea1b975595e5 (127.0.0.1:31202) has failed: Network 
error: TS failed: RPC frame had a length of 53477464, but we only support 
messages up to 52428800 bytes long.{code}
CC [~joemcdonnell] [~wzhou] [~aserbin] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-13202) KRPC flags used by libkudu_client.so can't be configured

2024-07-08 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13202:
---

 Summary: KRPC flags used by libkudu_client.so can't be configured
 Key: IMPALA-13202
 URL: https://issues.apache.org/jira/browse/IMPALA-13202
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Reporter: Quanlong Huang


The way Impala integrates with KRPC is porting the KRPC codes into the Impala 
code base. Flags and methods of KRPC are defined as GLOBAL in the impalad 
executable. libkudu_client.so also compiles from the same KRPC codes and have 
duplicate flags and methods defined as HIDDEN.

To be specifit, both the impalad executable and libkudu_client.so have the 
symbol for kudu::rpc::InboundTransfer::ReceiveBuffer() 
{noformat}
$ readelf -s --wide be/build/latest/service/impalad | grep ReceiveBuffer
 8: 022f5c88  1936 FUNCGLOBAL DEFAULT   13 
_ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
 81380: 022f5c88  1936 FUNCGLOBAL DEFAULT   13 
_ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE

$ readelf -s --wide 
toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so
 | grep ReceiveBuffer
  1601: 00086e4a   108 FUNCLOCAL  DEFAULT   12 
_ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE.cold
 11905: 001fec60  2076 FUNCLOCAL  HIDDEN12 
_ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE

$ c++filt 
_ZN4kudu3rpc15InboundTransfer13ReceiveBufferEPNS_6SocketEPNS_10faststringE
kudu::rpc::InboundTransfer::ReceiveBuffer(kudu::Socket*, kudu::faststring*) 
{noformat}
KRPC flags like rpc_max_message_size are also defined in both the impalad 
executable and libkudu_client.so:
{noformat}
$ readelf -s --wide be/build/latest/service/impalad | grep 
FLAGS_rpc_max_message_size
 14380: 06006738 8 OBJECT  GLOBAL DEFAULT   30 
_ZN5fLI6426FLAGS_rpc_max_message_sizeE
 80396: 06006741 1 OBJECT  GLOBAL DEFAULT   30 
_ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
 81399: 06006741 1 OBJECT  GLOBAL DEFAULT   30 
_ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
117873: 06006738 8 OBJECT  GLOBAL DEFAULT   30 
_ZN5fLI6426FLAGS_rpc_max_message_sizeE

$ readelf -s --wide 
toolchain/toolchain-packages-gcc10.4.0/kudu-e742f86f6d/debug/lib/libkudu_client.so
 | grep FLAGS_rpc_max_message_size
 11882: 008d61e1 1 OBJECT  LOCAL  HIDDEN27 
_ZN3fLB44FLAGS_rpc_max_message_size_enable_validationE
 11906: 008d61d8 8 OBJECT  LOCAL  DEFAULT   27 
_ZN5fLI6426FLAGS_rpc_max_message_sizeE

$ c++filt _ZN5fLI6426FLAGS_rpc_max_message_sizeE
fLI64::FLAGS_rpc_max_message_size {noformat}
libkudu_client.so uses its own methods and flags. The flags are HIDDEN so can't 
be modified by Impala codes. E.g. IMPALA-4874 bumps FLAGS_rpc_max_message_size 
to 2GB in RpcMgr::Init(), but the HIDDEN variable FLAGS_rpc_max_message_size 
used in libkudu_client.so is still the default value 50MB (52428800). We've 
seen error messages like this in the master branch:
{code:java}
I0708 10:23:31.784974  2943 meta_cache.cc:294] 
c243bda4702a5ab9:0ba93d240001] tablet 0c8f3446538449ee9d3df5056afe775e: 
replica e0e1db54dab74f208e37ea1b975595e5 (127.0.0.1:31202) has failed: Network 
error: TS failed: RPC frame had a length of 53477464, but we only support 
messages up to 52428800 bytes long.{code}
CC [~joemcdonnell] [~wzhou] [~aserbin] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13200) Auto refresh on S3 tables based on S3 notification

2024-07-08 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13200:
---

 Summary: Auto refresh on S3 tables based on S3 notification
 Key: IMPALA-13200
 URL: https://issues.apache.org/jira/browse/IMPALA-13200
 Project: IMPALA
  Issue Type: New Feature
  Components: Catalog
Reporter: Quanlong Huang


S3 Event Notifications can be used to get updates on new files or file 
deletions:
[https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html]

Snowflake uses it to auto refresh external tables:
https://docs.snowflake.com/en/user-guide/tables-external-s3 

Other object storages like Google Cloud Storage and Azure Blob Storage also 
have the similar notification mechanism:
https://docs.snowflake.com/en/user-guide/tables-external-gcs
https://docs.snowflake.com/en/user-guide/tables-external-azure

CC [~mylogi...@gmail.com] [~hemanth619] [~VenuReddy] [~ngangam] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-13200) Auto refresh on S3 tables based on S3 notification

2024-07-08 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13200:
---

 Summary: Auto refresh on S3 tables based on S3 notification
 Key: IMPALA-13200
 URL: https://issues.apache.org/jira/browse/IMPALA-13200
 Project: IMPALA
  Issue Type: New Feature
  Components: Catalog
Reporter: Quanlong Huang


S3 Event Notifications can be used to get updates on new files or file 
deletions:
[https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html]

Snowflake uses it to auto refresh external tables:
https://docs.snowflake.com/en/user-guide/tables-external-s3 

Other object storages like Google Cloud Storage and Azure Blob Storage also 
have the similar notification mechanism:
https://docs.snowflake.com/en/user-guide/tables-external-gcs
https://docs.snowflake.com/en/user-guide/tables-external-azure

CC [~mylogi...@gmail.com] [~hemanth619] [~VenuReddy] [~ngangam] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13170) InconsistentMetadataFetchException due to database dropped when showing databases

2024-07-03 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13170:

Priority: Critical  (was: Major)

> InconsistentMetadataFetchException due to database dropped when showing 
> databases
> -
>
> Key: IMPALA-13170
> URL: https://issues.apache.org/jira/browse/IMPALA-13170
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 3.4.0
>Reporter: Yida Wu
>Assignee: Quanlong Huang
>Priority: Critical
> Fix For: Impala 4.5.0
>
>
> Using impalad 3.4.0, an InconsistentMetadataFetchException occurs when 
> running "show databases" in Impala while simultaneously executing "drop 
> database" to drop the newly created database in Hive.
> Step is:
> 1, Creates database (Hive)
> 2, Creates tables (Hive)
> 3, Drops tables (Hive)
> 4, Run show databases (Impala)  Drop database (Hive)
> Logs in Impalad:
> {code:java}
> I0610 02:18:32.435815 278475 CatalogdMetaProvider.java:1354] 1:2] 
> Invalidated objects in cache: [list of database names, HMS_METADATA for DB 
> test_hive]
> I0610 02:18:32.436224 278475 jni-util.cc:288] 1:2] 
> org.apache.impala.catalog.local.InconsistentMetadataFetchException: Fetching 
> DATABASE failed. Could not find TCatalogObject(type:DATABASE, 
> catalog_version:0, db:TDatabase(db_name:test_hive))   
>   
>   
> 
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.sendRequest(CatalogdMetaProvider.java:424)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.access$100(CatalogdMetaProvider.java:185)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:643)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:638)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadWithCaching(CatalogdMetaProvider.java:521)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadDb(CatalogdMetaProvider.java:635)
>   at org.apache.impala.catalog.local.LocalDb.getMetaStoreDb(LocalDb.java:91) 
>   at org.apache.impala.catalog.local.LocalDb.getOwnerUser(LocalDb.java:294)
>   at org.apache.impala.service.Frontend.getDbs(Frontend.java:1066)
>   at org.apache.impala.service.JniFrontend.getDbs(JniFrontend.java:301)
> I0610 02:18:32.436257 278475 status.cc:129] 1:2] 
> InconsistentMetadataFetchException: Fetching DATABASE failed. Could not find 
> TCatalogObject(type:DATABASE, catalog_version:0, 
> {code}
> Logs in Catalog:
> {code:java}
> I0610 02:18:16.190133 222885 MetastoreEvents.java:505] EventId: 141467532 
> EventType: CREATE_DATABASE Successfully added database test_hive 
> ...
> I0610 02:18:32.276082 222885 MetastoreEvents.java:516] EventId: 141467562 
> EventType: DROP_DATABASE Creating event 141467562 of type DROP_DATABASE on 
> database test_hive
> I0610 02:18:32.277876 222885 MetastoreEvents.java:254] Total number of events 
> received: 6 Total number of events filtered out: 0
> I0610 02:18:32.277910 222885 MetastoreEvents.java:258] Incremented skipped 
> metric to 2564
> I0610 02:18:32.279537 222885 MetastoreEvents.java:505] EventId: 141467562 
> EventType: DROP_DATABASE Removed Database test_hive
> {code}
> The case is similar to IMPALA-9441. We may want to handle the error in a 
> better way in Frontend.getDbs().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13170) InconsistentMetadataFetchException due to database dropped when showing databases

2024-07-03 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13170:

Priority: Major  (was: Critical)

> InconsistentMetadataFetchException due to database dropped when showing 
> databases
> -
>
> Key: IMPALA-13170
> URL: https://issues.apache.org/jira/browse/IMPALA-13170
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 3.4.0
>Reporter: Yida Wu
>Assignee: Quanlong Huang
>Priority: Major
> Fix For: Impala 4.5.0
>
>
> Using impalad 3.4.0, an InconsistentMetadataFetchException occurs when 
> running "show databases" in Impala while simultaneously executing "drop 
> database" to drop the newly created database in Hive.
> Step is:
> 1, Creates database (Hive)
> 2, Creates tables (Hive)
> 3, Drops tables (Hive)
> 4, Run show databases (Impala)  Drop database (Hive)
> Logs in Impalad:
> {code:java}
> I0610 02:18:32.435815 278475 CatalogdMetaProvider.java:1354] 1:2] 
> Invalidated objects in cache: [list of database names, HMS_METADATA for DB 
> test_hive]
> I0610 02:18:32.436224 278475 jni-util.cc:288] 1:2] 
> org.apache.impala.catalog.local.InconsistentMetadataFetchException: Fetching 
> DATABASE failed. Could not find TCatalogObject(type:DATABASE, 
> catalog_version:0, db:TDatabase(db_name:test_hive))   
>   
>   
> 
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.sendRequest(CatalogdMetaProvider.java:424)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.access$100(CatalogdMetaProvider.java:185)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:643)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:638)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadWithCaching(CatalogdMetaProvider.java:521)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadDb(CatalogdMetaProvider.java:635)
>   at org.apache.impala.catalog.local.LocalDb.getMetaStoreDb(LocalDb.java:91) 
>   at org.apache.impala.catalog.local.LocalDb.getOwnerUser(LocalDb.java:294)
>   at org.apache.impala.service.Frontend.getDbs(Frontend.java:1066)
>   at org.apache.impala.service.JniFrontend.getDbs(JniFrontend.java:301)
> I0610 02:18:32.436257 278475 status.cc:129] 1:2] 
> InconsistentMetadataFetchException: Fetching DATABASE failed. Could not find 
> TCatalogObject(type:DATABASE, catalog_version:0, 
> {code}
> Logs in Catalog:
> {code:java}
> I0610 02:18:16.190133 222885 MetastoreEvents.java:505] EventId: 141467532 
> EventType: CREATE_DATABASE Successfully added database test_hive 
> ...
> I0610 02:18:32.276082 222885 MetastoreEvents.java:516] EventId: 141467562 
> EventType: DROP_DATABASE Creating event 141467562 of type DROP_DATABASE on 
> database test_hive
> I0610 02:18:32.277876 222885 MetastoreEvents.java:254] Total number of events 
> received: 6 Total number of events filtered out: 0
> I0610 02:18:32.277910 222885 MetastoreEvents.java:258] Incremented skipped 
> metric to 2564
> I0610 02:18:32.279537 222885 MetastoreEvents.java:505] EventId: 141467562 
> EventType: DROP_DATABASE Removed Database test_hive
> {code}
> The case is similar to IMPALA-9441. We may want to handle the error in a 
> better way in Frontend.getDbs().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13192) Impala Coordinator stuck and Full GC when execute query from nested temporary table.

2024-07-02 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13192:

Priority: Critical  (was: Major)

> Impala Coordinator stuck and Full GC when execute query from nested temporary 
> table.
> 
>
> Key: IMPALA-13192
> URL: https://issues.apache.org/jira/browse/IMPALA-13192
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
> Environment: impalad version 4.3.0-RELEASE RELEASE (build 
> 14bb13e67e48742df72f9e1dd73be15ec7ba31bd)
>Reporter: LiuYuan
>Priority: Critical
>
> 1.Create a table as below:
>  
> {code:java}
> CREATE TABLE trunck_info (    
>   user_id BIGINT ,    
>   truck_length DOUBLE,    
>   length_type STRING,    
>   point_km DOUBLE,    
>   estimate_mileage DOUBLE,    
>   dep_rate DOUBLE,    
>   line_day_cnt_01 BIGINT,    
>   line_ly_cnt_01 BIGINT,    
>   line_day_cnt_30 BIGINT,    
>   line_ly_cnt_30 BIGINT,    
>   line_day_cnt_60 BIGINT,    
>   line_ly_cnt_60 BIGINT,    
>   num_all_60 BIGINT,    
>   num_est_60 BIGINT,    
>   num_est_order_60 BIGINT,    
>   num_act_60 BIGINT,    
>   num_inh_60 BIGINT,    
>   num_all_30 BIGINT,    
>   num_est_30 BIGINT,    
>   num_est_order_30 BIGINT,    
>   num_act_30 BIGINT,    
>   num_inh_30 BIGINT,    
>   conn_num_60 BIGINT,    
>   conn_num_30 BIGINT,    
>   hp_num_60 INT,    
>   hp_num_30 INT,    
>   bzj_num INT,    
>   feidan8_num_60 BIGINT,    
>   feidan8_num_30 INT,    
>   ts_num_60 BIGINT,    
>   ts_num_30 INT,    
>   new_mile_point_60 BIGINT,    
>   new_mile_point_30 BIGINT    
> )    
> WITH SERDEPROPERTIES ('serialization.format'='1')
> STORED AS TEXTFILE {code}
>  
> 2.Query from nested temporary table, we can see coordinator hung and full gc
>  
>  
>  
> {panel:title=hung.sql}
> with t1
> as
> (
> select  user_id
>    ,nvl(num_inh_60,0)+nvl(conn_num_60,0)+nvl(new_mile_point_60,0) as 
> score_all   
>    ,                  nvl(conn_num_60,0)+nvl(new_mile_point_60,0) as 
> score_noinh 
>   from trunck_info
> )
> ,t2
> as
> (
> select  user_id
>        ,score_noinh + score_inh as score_all
>        ,score_noinh
>   from
>   (
> select  user_id
>        ,score_noinh
>        ,case when score_all >= 800 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 600 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 450 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 300 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >    0 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>     end as score_inh 
>   from t1
>  where score_noinh > 0
>   ) a
> )
> ,t3
> as
> (
> select  user_id
>        ,score_noinh + score_inh as score_all
>        ,score_noinh
>   from
>   (
> select  user_id
>        ,score_noinh
>        ,case when score_all >= 800 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 600 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 450 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 300 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >    0 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>     end as score_inh 
>   from t2
>  where score_noinh > 0
>   ) a
> )
> ,t4
> as
> (
> select  user_id
>        ,score_noinh + score_inh as score_all
>        ,score_noinh
>   from
>   (
> select  user_id
>        ,score_noinh
>        ,case when score_all >= 800 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 600 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 450 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 300 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >    0 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>     end as score_inh 
>   from t3
>  where score_noinh > 0
>   ) a
> )
> ,t5
> as
> (
> select  user_id
>        ,score_noinh + score_inh as score_all
>        ,score_noinh
>   from
>   (
> select  user_id
>        ,score_noinh
>        ,case when score_all >= 800 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 600 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 450 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >= 300 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>              when score_all >    0 then if(score_all*0.5 >= 
> 450,450,score_all*0.5)
>     end as score_inh 
>   from t4
>  where score_noinh > 0
>   ) a
> )
> ,t6
> as
> (
> select  user_id
>        ,score_noinh + 

[jira] [Updated] (IMPALA-13193) RuntimeFilter on parquet dictionary should evaluate null values

2024-07-02 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13193:

Affects Version/s: Impala 4.4.0
   Impala 4.3.0
   Impala 4.1.2
   Impala 4.1.1
   Impala 4.2.0
   Impala 4.1.0

> RuntimeFilter on parquet dictionary should evaluate null values
> ---
>
> Key: IMPALA-13193
> URL: https://issues.apache.org/jira/browse/IMPALA-13193
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.1.0, Impala 4.2.0, Impala 4.1.1, Impala 4.1.2, 
> Impala 4.3.0, Impala 4.4.0
>Reporter: Quanlong Huang
>Priority: Critical
>
> IMPALA-10910, IMPALA-5509 introduces an optimization to evaluate runtime 
> filter on parquet dictionary values. If non of the values can pass the check, 
> the whole row group will be skipped. However, NULL values are not included in 
> the parquet dictionary. Runtime filters that accept NULL values might 
> incorrectly reject the row group if none of the dictionary values can pass 
> the check.
> Here are steps to reproduce the bug:
> {code:sql}
> create table parq_tbl (id bigint, name string) stored as parquet;
> insert into parq_tbl values (0, "abc"), (1, NULL), (2, NULL), (3, "abc");
> create table dim_tbl (name string);
> insert into dim_tbl values (NULL);
> select * from parq_tbl p join dim_tbl d
>   on COALESCE(p.name, '') = COALESCE(d.name, '');{code}
> The SELECT query should return 2 rows but now it returns 0 rows.
> A workaround is to disable this optimization:
> {code:sql}
> set PARQUET_DICTIONARY_RUNTIME_FILTER_ENTRY_LIMIT=0;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13193) RuntimeFilter on parquet dictionary should evaluate null values

2024-07-02 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13193:
---

 Summary: RuntimeFilter on parquet dictionary should evaluate null 
values
 Key: IMPALA-13193
 URL: https://issues.apache.org/jira/browse/IMPALA-13193
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Reporter: Quanlong Huang


IMPALA-10910, IMPALA-5509 introduces an optimization to evaluate runtime filter 
on parquet dictionary values. If non of the values can pass the check, the 
whole row group will be skipped. However, NULL values are not included in the 
parquet dictionary. Runtime filters that accept NULL values might incorrectly 
reject the row group if none of the dictionary values can pass the check.

Here are steps to reproduce the bug:
{code:sql}
create table parq_tbl (id bigint, name string) stored as parquet;
insert into parq_tbl values (0, "abc"), (1, NULL), (2, NULL), (3, "abc");

create table dim_tbl (name string);
insert into dim_tbl values (NULL);

select * from parq_tbl p join dim_tbl d
  on COALESCE(p.name, '') = COALESCE(d.name, '');{code}
The SELECT query should return 2 rows but now it returns 0 rows.

A workaround is to disable this optimization:
{code:sql}
set PARQUET_DICTIONARY_RUNTIME_FILTER_ENTRY_LIMIT=0;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13193) RuntimeFilter on parquet dictionary should evaluate null values

2024-07-02 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13193:
---

 Summary: RuntimeFilter on parquet dictionary should evaluate null 
values
 Key: IMPALA-13193
 URL: https://issues.apache.org/jira/browse/IMPALA-13193
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Reporter: Quanlong Huang


IMPALA-10910, IMPALA-5509 introduces an optimization to evaluate runtime filter 
on parquet dictionary values. If non of the values can pass the check, the 
whole row group will be skipped. However, NULL values are not included in the 
parquet dictionary. Runtime filters that accept NULL values might incorrectly 
reject the row group if none of the dictionary values can pass the check.

Here are steps to reproduce the bug:
{code:sql}
create table parq_tbl (id bigint, name string) stored as parquet;
insert into parq_tbl values (0, "abc"), (1, NULL), (2, NULL), (3, "abc");

create table dim_tbl (name string);
insert into dim_tbl values (NULL);

select * from parq_tbl p join dim_tbl d
  on COALESCE(p.name, '') = COALESCE(d.name, '');{code}
The SELECT query should return 2 rows but now it returns 0 rows.

A workaround is to disable this optimization:
{code:sql}
set PARQUET_DICTIONARY_RUNTIME_FILTER_ENTRY_LIMIT=0;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IMPALA-5509) Runtime filter : Extend runtime filter to support Dictionary values

2024-07-02 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-5509:
---
Fix Version/s: Impala 4.1.0

> Runtime filter : Extend runtime filter to support Dictionary values
> ---
>
> Key: IMPALA-5509
> URL: https://issues.apache.org/jira/browse/IMPALA-5509
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.9.0
>Reporter: Alan Choi
>Assignee: Csaba Ringhofer
>Priority: Major
>  Labels: performance, runtime-filters
> Fix For: Impala 4.1.0
>
>
> For runtime filter on a single column, it can be run against the dictionary 
> values in Parquet to enable efficient block filtering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-13170) InconsistentMetadataFetchException due to database dropped when showing databases

2024-07-01 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-13170.
-
Fix Version/s: Impala 4.5.0
   Resolution: Fixed

> InconsistentMetadataFetchException due to database dropped when showing 
> databases
> -
>
> Key: IMPALA-13170
> URL: https://issues.apache.org/jira/browse/IMPALA-13170
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 3.4.0
>Reporter: Yida Wu
>Assignee: Quanlong Huang
>Priority: Major
> Fix For: Impala 4.5.0
>
>
> Using impalad 3.4.0, an InconsistentMetadataFetchException occurs when 
> running "show databases" in Impala while simultaneously executing "drop 
> database" to drop the newly created database in Hive.
> Step is:
> 1, Creates database (Hive)
> 2, Creates tables (Hive)
> 3, Drops tables (Hive)
> 4, Run show databases (Impala)  Drop database (Hive)
> Logs in Impalad:
> {code:java}
> I0610 02:18:32.435815 278475 CatalogdMetaProvider.java:1354] 1:2] 
> Invalidated objects in cache: [list of database names, HMS_METADATA for DB 
> test_hive]
> I0610 02:18:32.436224 278475 jni-util.cc:288] 1:2] 
> org.apache.impala.catalog.local.InconsistentMetadataFetchException: Fetching 
> DATABASE failed. Could not find TCatalogObject(type:DATABASE, 
> catalog_version:0, db:TDatabase(db_name:test_hive))   
>   
>   
> 
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.sendRequest(CatalogdMetaProvider.java:424)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.access$100(CatalogdMetaProvider.java:185)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:643)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:638)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadWithCaching(CatalogdMetaProvider.java:521)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadDb(CatalogdMetaProvider.java:635)
>   at org.apache.impala.catalog.local.LocalDb.getMetaStoreDb(LocalDb.java:91) 
>   at org.apache.impala.catalog.local.LocalDb.getOwnerUser(LocalDb.java:294)
>   at org.apache.impala.service.Frontend.getDbs(Frontend.java:1066)
>   at org.apache.impala.service.JniFrontend.getDbs(JniFrontend.java:301)
> I0610 02:18:32.436257 278475 status.cc:129] 1:2] 
> InconsistentMetadataFetchException: Fetching DATABASE failed. Could not find 
> TCatalogObject(type:DATABASE, catalog_version:0, 
> {code}
> Logs in Catalog:
> {code:java}
> I0610 02:18:16.190133 222885 MetastoreEvents.java:505] EventId: 141467532 
> EventType: CREATE_DATABASE Successfully added database test_hive 
> ...
> I0610 02:18:32.276082 222885 MetastoreEvents.java:516] EventId: 141467562 
> EventType: DROP_DATABASE Creating event 141467562 of type DROP_DATABASE on 
> database test_hive
> I0610 02:18:32.277876 222885 MetastoreEvents.java:254] Total number of events 
> received: 6 Total number of events filtered out: 0
> I0610 02:18:32.277910 222885 MetastoreEvents.java:258] Incremented skipped 
> metric to 2564
> I0610 02:18:32.279537 222885 MetastoreEvents.java:505] EventId: 141467562 
> EventType: DROP_DATABASE Removed Database test_hive
> {code}
> The case is similar to IMPALA-9441. We may want to handle the error in a 
> better way in Frontend.getDbs().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-9441) TestHS2.test_get_schemas is flaky in local catalog mode

2024-07-01 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-9441.

Fix Version/s: Impala 4.5.0
   Resolution: Fixed

> TestHS2.test_get_schemas is flaky in local catalog mode
> ---
>
> Key: IMPALA-9441
> URL: https://issues.apache.org/jira/browse/IMPALA-9441
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Sahil Takiar
>Assignee: Quanlong Huang
>Priority: Critical
> Fix For: Impala 4.5.0
>
>
> Saw this once on a ubuntu-16.04-dockerised-tests job:
> {code:java}
> Error Message
> hs2/hs2_test_suite.py:63: in add_session lambda: fn(self)) 
> hs2/hs2_test_suite.py:44: in add_session_helper fn() 
> hs2/hs2_test_suite.py:63: in  lambda: fn(self)) 
> hs2/test_hs2.py:423: in test_get_schemas 
> TestHS2.check_response(get_schemas_resp) hs2/hs2_test_suite.py:131: in 
> check_response assert response.status.statusCode == expected_status_code 
> E   assert 3 == 0 E+  where 3 = 3 E+where 3 = 
> TStatus(errorCode=None, errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3).statusCode E+  where 
> TStatus(errorCode=None, errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3) = TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3) E+where TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3) = 
> TGetSchemasResp(status=TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_i...nHandle(hasResultSet=False, modifiedRowCount=None, 
> operationType=3, operationId=THandleIdentifier(secret='', guid=''))).status
> Stacktrace
> hs2/hs2_test_suite.py:63: in add_session
> lambda: fn(self))
> hs2/hs2_test_suite.py:44: in add_session_helper
> fn()
> hs2/hs2_test_suite.py:63: in 
> lambda: fn(self))
> hs2/test_hs2.py:423: in test_get_schemas
> TestHS2.check_response(get_schemas_resp)
> hs2/hs2_test_suite.py:131: in check_response
> assert response.status.statusCode == expected_status_code
> E   assert 3 == 0
> E+  where 3 = 3
> E+where 3 = TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3).statusCode
> E+  where TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3) = TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3)
> E+where TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3) = 
> TGetSchemasResp(status=TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_i...nHandle(hasResultSet=False, modifiedRowCount=None, 
> operationType=3, operationId=THandleIdentifier(secret='', guid=''))).status 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IMPALA-13170) InconsistentMetadataFetchException due to database dropped when showing databases

2024-07-01 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-13170.
-
Fix Version/s: Impala 4.5.0
   Resolution: Fixed

> InconsistentMetadataFetchException due to database dropped when showing 
> databases
> -
>
> Key: IMPALA-13170
> URL: https://issues.apache.org/jira/browse/IMPALA-13170
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 3.4.0
>Reporter: Yida Wu
>Assignee: Quanlong Huang
>Priority: Major
> Fix For: Impala 4.5.0
>
>
> Using impalad 3.4.0, an InconsistentMetadataFetchException occurs when 
> running "show databases" in Impala while simultaneously executing "drop 
> database" to drop the newly created database in Hive.
> Step is:
> 1, Creates database (Hive)
> 2, Creates tables (Hive)
> 3, Drops tables (Hive)
> 4, Run show databases (Impala)  Drop database (Hive)
> Logs in Impalad:
> {code:java}
> I0610 02:18:32.435815 278475 CatalogdMetaProvider.java:1354] 1:2] 
> Invalidated objects in cache: [list of database names, HMS_METADATA for DB 
> test_hive]
> I0610 02:18:32.436224 278475 jni-util.cc:288] 1:2] 
> org.apache.impala.catalog.local.InconsistentMetadataFetchException: Fetching 
> DATABASE failed. Could not find TCatalogObject(type:DATABASE, 
> catalog_version:0, db:TDatabase(db_name:test_hive))   
>   
>   
> 
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.sendRequest(CatalogdMetaProvider.java:424)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.access$100(CatalogdMetaProvider.java:185)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:643)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:638)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadWithCaching(CatalogdMetaProvider.java:521)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadDb(CatalogdMetaProvider.java:635)
>   at org.apache.impala.catalog.local.LocalDb.getMetaStoreDb(LocalDb.java:91) 
>   at org.apache.impala.catalog.local.LocalDb.getOwnerUser(LocalDb.java:294)
>   at org.apache.impala.service.Frontend.getDbs(Frontend.java:1066)
>   at org.apache.impala.service.JniFrontend.getDbs(JniFrontend.java:301)
> I0610 02:18:32.436257 278475 status.cc:129] 1:2] 
> InconsistentMetadataFetchException: Fetching DATABASE failed. Could not find 
> TCatalogObject(type:DATABASE, catalog_version:0, 
> {code}
> Logs in Catalog:
> {code:java}
> I0610 02:18:16.190133 222885 MetastoreEvents.java:505] EventId: 141467532 
> EventType: CREATE_DATABASE Successfully added database test_hive 
> ...
> I0610 02:18:32.276082 222885 MetastoreEvents.java:516] EventId: 141467562 
> EventType: DROP_DATABASE Creating event 141467562 of type DROP_DATABASE on 
> database test_hive
> I0610 02:18:32.277876 222885 MetastoreEvents.java:254] Total number of events 
> received: 6 Total number of events filtered out: 0
> I0610 02:18:32.277910 222885 MetastoreEvents.java:258] Incremented skipped 
> metric to 2564
> I0610 02:18:32.279537 222885 MetastoreEvents.java:505] EventId: 141467562 
> EventType: DROP_DATABASE Removed Database test_hive
> {code}
> The case is similar to IMPALA-9441. We may want to handle the error in a 
> better way in Frontend.getDbs().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IMPALA-9441) TestHS2.test_get_schemas is flaky in local catalog mode

2024-07-01 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-9441.

Fix Version/s: Impala 4.5.0
   Resolution: Fixed

> TestHS2.test_get_schemas is flaky in local catalog mode
> ---
>
> Key: IMPALA-9441
> URL: https://issues.apache.org/jira/browse/IMPALA-9441
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Sahil Takiar
>Assignee: Quanlong Huang
>Priority: Critical
> Fix For: Impala 4.5.0
>
>
> Saw this once on a ubuntu-16.04-dockerised-tests job:
> {code:java}
> Error Message
> hs2/hs2_test_suite.py:63: in add_session lambda: fn(self)) 
> hs2/hs2_test_suite.py:44: in add_session_helper fn() 
> hs2/hs2_test_suite.py:63: in  lambda: fn(self)) 
> hs2/test_hs2.py:423: in test_get_schemas 
> TestHS2.check_response(get_schemas_resp) hs2/hs2_test_suite.py:131: in 
> check_response assert response.status.statusCode == expected_status_code 
> E   assert 3 == 0 E+  where 3 = 3 E+where 3 = 
> TStatus(errorCode=None, errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3).statusCode E+  where 
> TStatus(errorCode=None, errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3) = TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3) E+where TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3) = 
> TGetSchemasResp(status=TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_i...nHandle(hasResultSet=False, modifiedRowCount=None, 
> operationType=3, operationId=THandleIdentifier(secret='', guid=''))).status
> Stacktrace
> hs2/hs2_test_suite.py:63: in add_session
> lambda: fn(self))
> hs2/hs2_test_suite.py:44: in add_session_helper
> fn()
> hs2/hs2_test_suite.py:63: in 
> lambda: fn(self))
> hs2/test_hs2.py:423: in test_get_schemas
> TestHS2.check_response(get_schemas_resp)
> hs2/hs2_test_suite.py:131: in check_response
> assert response.status.statusCode == expected_status_code
> E   assert 3 == 0
> E+  where 3 = 3
> E+where 3 = TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3).statusCode
> E+  where TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3) = TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3)
> E+where TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_impala_2201_e794b8f' not found\n", sqlState='HY000', 
> infoMessages=None, statusCode=3) = 
> TGetSchemasResp(status=TStatus(errorCode=None, 
> errorMessage="DatabaseNotFoundException: Database 
> 'test_compute_stats_i...nHandle(hasResultSet=False, modifiedRowCount=None, 
> operationType=3, operationId=THandleIdentifier(secret='', guid=''))).status 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-13161) impalad crash -- impala::DelimitedTextParser::ParseFieldLocations

2024-06-30 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860997#comment-17860997
 ] 

Quanlong Huang commented on IMPALA-13161:
-

Uploaded a fix for review: https://gerrit.cloudera.org/c/21559/

> impalad crash -- impala::DelimitedTextParser::ParseFieldLocations
> ---
>
> Key: IMPALA-13161
> URL: https://issues.apache.org/jira/browse/IMPALA-13161
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.0.0, Impala 4.4.0
>Reporter: nyq
>Assignee: Quanlong Huang
>Priority: Critical
>
> Impala version: 4.0.0
> Problem:
> impalad crash, by operating a text table, which has a 3GB data file that only 
> contains '\x00' char
> Steps:
> python -c 'f=open("impala_0_3gb.data.csv", "wb");tmp="\x00"*1024*1024*3; 
> [f.write(tmp) for i in range(1024)] ;f.close()'
> create table impala_0_3gb (id int)
> hdfs dfs -put impala_0_3gb.data.csv /user/hive/warehouse/impala_0_3gb/
> refresh impala_0_3gb
> select count(1) from impala_0_3gb
> Errors:
> Wrote minidump to 1dcf110f-5a2e-49a2-be4eb7a5-4709ed19.dmp
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0181861c, pid=956182, tid=0x7fc6b340e700
> #
> # JRE version: OpenJDK Runtime Environment (8.0) (build 1.8.0)
> # Java VM: OpenJDK 64-Bit Server VM
> # Problematic frame:
> # C  [impalad+0x141861c]  
> impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
> char**, impala::FieldLocation*, int*, int*, char**)+0x7cc
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid956182.log
> #
> #
> C  [impalad+0x141861c]  
> impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
> char**, impala::FieldLocation*, int*, int*, char**)+0x7cc
> C  [impalad+0x136fe11]  
> impala::HdfsTextScanner::ProcessRange(impala::RowBatch*, int*)+0x1a1
> C  [impalad+0x137100e]  
> impala::HdfsTextScanner::FinishScanRange(impala::RowBatch*)+0x3be
> C  [impalad+0x13721ac]  
> impala::HdfsTextScanner::GetNextInternal(impala::RowBatch*)+0x12c
> C  [impalad+0x131cdfc]  impala::HdfsScanner::ProcessSplit()+0x19c
> C  [impalad+0x1443e17]  
> impala::HdfsScanNode::ProcessSplit(std::vector std::allocator > const&, impala::MemPool*, 
> impala::io::ScanRange*, long*)+0x7e7
> C  [impalad+0x1447001]  impala::HdfsScanNode::ScannerThread(bool, long)+0x541



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-13161) impalad crash -- impala::DelimitedTextParser::ParseFieldLocations

2024-06-28 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860808#comment-17860808
 ] 

Quanlong Huang commented on IMPALA-13161:
-

Got the stacktrace in gdb:
{noformat}
Thread 304 "impalad" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f0abfe7a700 (LWP 10172)]
0x023d4273 in impala::DelimitedTextParser::ReturnCurrentColumn 
(this=0xde37f40) at 
/home/quanlong/workspace/Impala/be/src/exec/delimited-text-parser.h:113
113   bool ReturnCurrentColumn() const {
(gdb) bt
#0  0x023d4273 in 
impala::DelimitedTextParser::ReturnCurrentColumn (this=0xde37f40) at 
/home/quanlong/workspace/Impala/be/src/exec/delimited-text-parser.h:113
#1  impala::DelimitedTextParser::AddColumn (field_locations=0x0, 
num_fields=0x7f0abfe78824, next_column_start=0x7f0abfe78828, len=0, 
this=0xde37f40) at 
/home/quanlong/workspace/Impala/be/src/exec/delimited-text-parser.inline.h:62
#2  impala::DelimitedTextParser::ParseSse 
(this=this@entry=0xde37f40, max_tuples=max_tuples@entry=1, 
remaining_len=remaining_len@entry=0x7f0abfe78718, 
byte_buffer_ptr=byte_buffer_ptr@entry=0xd074f88, 
row_end_locations=row_end_locations@entry=0xcf5, field_locations=0x0, 
num_tuples=0x7f0abfe78a80, num_fields=0x7f0abfe78824, 
next_column_start=0x7f0abfe78828)
at 
/home/quanlong/workspace/Impala/be/src/exec/delimited-text-parser.inline.h:189
#3  0x023d4981 in 
impala::DelimitedTextParser::ParseFieldLocations (this=0xde37f40, 
max_tuples=max_tuples@entry=1, remaining_len=, 
byte_buffer_ptr=byte_buffer_ptr@entry=0xd074f88, row_end_locations=0xcf5, 
field_locations=0x0, num_tuples=0x7f0abfe78a80, num_fields=0x7f0abfe78824, 
next_column_start=0x7f0abfe78828) at 
/home/quanlong/workspace/Impala/be/src/common/status.h:105
#4  0x02057247 in impala::HdfsTextScanner::ProcessRange 
(this=this@entry=0xd074dc0, row_batch=row_batch@entry=0x1618f760, 
num_tuples=num_tuples@entry=0x7f0abfe78a80)
at 
/home/quanlong/workspace/Impala/toolchain/toolchain-packages-gcc10.4.0/gcc-10.4.0/include/c++/10.4.0/bits/stl_vector.h:1168
#5  0x0205961f in impala::HdfsTextScanner::FinishScanRange 
(this=this@entry=0xd074dc0, row_batch=row_batch@entry=0x1618f760) at 
/home/quanlong/workspace/Impala/be/src/exec/text/hdfs-text-scanner.cc:361
#6  0x02059d6d in impala::HdfsTextScanner::GetNextInternal 
(this=0xd074dc0, row_batch=0x1618f760) at 
/home/quanlong/workspace/Impala/be/src/exec/text/hdfs-text-scanner.cc:491
#7  0x01b34223 in impala::HdfsScanner::ProcessSplit (this=0xd074dc0) at 
/home/quanlong/workspace/Impala/toolchain/toolchain-packages-gcc10.4.0/gcc-10.4.0/include/c++/10.4.0/bits/unique_ptr.h:421
... {noformat}
It crashed in ReturnCurrentColumn() which has only one line:
{code:cpp}
110   /// Will we return the current column to the query?   
   
111   /// Hive allows cols at the end of the table that are not in the schema.  
We'll  
112   /// just ignore those columns 
   
113   bool ReturnCurrentColumn() const {
   
114 return column_idx_ < num_cols_ && is_materialized_col_[column_idx_];
   
115   } 
{code}
The type of column_idx_ is int but it overflows and become negative 
(-2147483648):
{noformat}
(gdb) p *this
$1 = {xmm_tuple_search_ = {3338, 0}, xmm_delim_search_ = {3338, 0}, 
xmm_escape_search_ = {5216405793391866985, 5651570509107196769}, 
is_materialized_col_ = 0x798b740, num_tuple_delims_ = 2, num_delims_ = 3, 
num_cols_ = 1, 
  num_partition_keys_ = 0, column_idx_ = -2147483648, last_row_delim_offset_ = 
-1, low_mask_ = {0 }, high_mask_ = {0 }, 
field_delim_ = 0 '\000', process_escapes_ = false, escape_char_ = 0 '\000', 
  collection_item_delim_ = 0 '\000', tuple_delim_ = 10 '\n', 
current_column_has_escape_ = false, last_char_is_escape_ = false, 
unfinished_tuple_ = true}
(gdb) p/x column_idx_
$2 = 0x8000{noformat}
I think the overflow happens here:
https://github.com/apache/impala/blob/333902afcccb8a45c25ae558cc67ceb719bccbfc/be/src/exec/delimited-text-parser.inline.h#L74

\x00 is considered as the default field delimiter. The number of columns 
overflow the int type.

> impalad crash -- impala::DelimitedTextParser::ParseFieldLocations
> ---
>
> Key: IMPALA-13161
> URL: https://issues.apache.org/jira/browse/IMPALA-13161
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.0.0, Impala 4.4.0
>Reporter: nyq
>Assignee: Quanlong Huang
>Priority: 

[jira] [Assigned] (IMPALA-13161) impalad crash -- impala::DelimitedTextParser::ParseFieldLocations

2024-06-28 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang reassigned IMPALA-13161:
---

Assignee: Quanlong Huang

> impalad crash -- impala::DelimitedTextParser::ParseFieldLocations
> ---
>
> Key: IMPALA-13161
> URL: https://issues.apache.org/jira/browse/IMPALA-13161
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.0.0, Impala 4.4.0
>Reporter: nyq
>Assignee: Quanlong Huang
>Priority: Critical
>
> Impala version: 4.0.0
> Problem:
> impalad crash, by operating a text table, which has a 3GB data file that only 
> contains '\x00' char
> Steps:
> python -c 'f=open("impala_0_3gb.data.csv", "wb");tmp="\x00"*1024*1024*3; 
> [f.write(tmp) for i in range(1024)] ;f.close()'
> create table impala_0_3gb (id int)
> hdfs dfs -put impala_0_3gb.data.csv /user/hive/warehouse/impala_0_3gb/
> refresh impala_0_3gb
> select count(1) from impala_0_3gb
> Errors:
> Wrote minidump to 1dcf110f-5a2e-49a2-be4eb7a5-4709ed19.dmp
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0181861c, pid=956182, tid=0x7fc6b340e700
> #
> # JRE version: OpenJDK Runtime Environment (8.0) (build 1.8.0)
> # Java VM: OpenJDK 64-Bit Server VM
> # Problematic frame:
> # C  [impalad+0x141861c]  
> impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
> char**, impala::FieldLocation*, int*, int*, char**)+0x7cc
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid956182.log
> #
> #
> C  [impalad+0x141861c]  
> impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
> char**, impala::FieldLocation*, int*, int*, char**)+0x7cc
> C  [impalad+0x136fe11]  
> impala::HdfsTextScanner::ProcessRange(impala::RowBatch*, int*)+0x1a1
> C  [impalad+0x137100e]  
> impala::HdfsTextScanner::FinishScanRange(impala::RowBatch*)+0x3be
> C  [impalad+0x13721ac]  
> impala::HdfsTextScanner::GetNextInternal(impala::RowBatch*)+0x12c
> C  [impalad+0x131cdfc]  impala::HdfsScanner::ProcessSplit()+0x19c
> C  [impalad+0x1443e17]  
> impala::HdfsScanNode::ProcessSplit(std::vector std::allocator > const&, impala::MemPool*, 
> impala::io::ScanRange*, long*)+0x7e7
> C  [impalad+0x1447001]  impala::HdfsScanNode::ScannerThread(bool, long)+0x541



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-13120) Failed table loads are not tried to load again even though hive metastore is UP

2024-06-28 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang reassigned IMPALA-13120:
---

Assignee: Venugopal Reddy K

> Failed table loads are not tried to load again even though hive metastore is 
> UP
> ---
>
> Key: IMPALA-13120
> URL: https://issues.apache.org/jira/browse/IMPALA-13120
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Venugopal Reddy K
>Assignee: Venugopal Reddy K
>Priority: Major
>
> *Description:*
> If the metastore is down at the time when the table load is triggered, 
> catalogd creates a new IncompleteTable instance with 
> cause=TableLoadingException and updates catalog with a new version. And on 
> coordinator/impalad, StmtMetadataLoader loadTables() that has been waiting 
> for table load to complete, considers table as loaded/failed load. Then 
> during the analyzer’s table resolve step, if the table is incomplete, 
> TableLoadingException is thrown to user.
> Note: IncompleteTable with cause not being null is considered as loaded.
> *Henceforth,  queries on the table doesn’t trigger the table load(at 
> StmtMetadataLoader) since the table is IncompleteTable with non-null 
> cause(i.e.,TableLoadingException). Even though metastore is UP later at some 
> time, queries continue to fail with same TableLoadingException:*
> {{CAUSED BY: TableLoadingException: Failed to load metadata for table: 
> default.t1. Running 'invalidate metadata default.t1' may resolve this 
> problem.}}
> {{CAUSED BY: MetaException: Could not connect to meta store using any of the 
> URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: java.net.ConnectException: 
> Connection refused (Connection refused)}}
> *At present, explicit Invalidate metadata is the only way to recover table 
> from this state.* {*}Queries executed after metastore is up should succeed 
> without the need for explicit invalidate metadata{*}{*}{{*}}
> *Steps to Reproduce:*
>  # create a table from hive and insert some data into it.
>  # Bring down the hive metastore  process
>  # Run a query on impala that triggers the table load. Query fails with 
> TableLoadingException.
>  # Bring up the hive metastore process
>  # Run the query on impala again. It still fails with same 
> TableLoadingException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12141) IllegalMonitorStateException when trying to release the table lock

2024-06-28 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860738#comment-17860738
 ] 

Quanlong Huang commented on IMPALA-12141:
-

[~VenuReddy], [~hemanth619] Do you want to take this?

> IllegalMonitorStateException when trying to release the table lock
> --
>
> Key: IMPALA-12141
> URL: https://issues.apache.org/jira/browse/IMPALA-12141
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Priority: Critical
>
> We saw event-processor went into the ERROR state due to an 
> IllegalMonitorStateException:
> {noformat}
> I0504 12:28:45.272922 189771 MetastoreEvents.java:401] EventId: 56369449 
> EventType: INSERT Incremented events skipped counter to 283902
> I0504 12:28:45.272941 189771 MetastoreEvents.java:401] EventId: 56369449 
> EventType: INSERT Not processing the event as it is a self-event
> I0504 14:28:45.283041 189771 MetastoreEvents.java:412] EventId: 56369450 
> EventType: INSERT Received exception Error during self-event evaluation for 
> table xxx. due to lock contention. Ignoring self-event evaluation
> E0504 16:28:45.286149 189771 MetastoreEventsProcessor.java:684] Unexpected 
> exception received while processing event
> Java exception follows:
> java.lang.IllegalMonitorStateException
> at 
> java.base/java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryRelease(ReentrantReadWriteLock.java:372)
> at 
> java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1302)
> at 
> java.base/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.unlock(ReentrantReadWriteLock.java:1147)
> at org.apache.impala.catalog.Table.releaseWriteLock(Table.java:262)
> at 
> org.apache.impala.service.CatalogOpExecutor.reloadPartitionIfExists(CatalogOpExecutor.java:3788)
> at 
> org.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.reloadPartition(MetastoreEvents.java:633)
> at 
> org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.processPartitionInserts(MetastoreEvents.java:851)
> at 
> org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.process(MetastoreEvents.java:835)
> at 
> org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:346)
> at 
> org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:772)
> at 
> org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:670)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at 
> java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
> at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> E0504 16:28:45.286345 189771 MetastoreEventsProcessor.java:795] Notification 
> event is null
> {noformat}
> It's due to the following try-clause:
> {code:java}
> try {
>   tryWriteLock(table, reason); // throws InternalException if timeout (2h) to 
> get write lock
>   ...
>   return numOfPartsReloaded;
> } catch (TableLoadingException e) { 
>   ...
> } catch (InternalException e) { 
>   throw new CatalogException(
>   "Could not acquire lock on the table " + table.getFullName(), e); 
> } finally {
>   UnlockWriteLockIfErronouslyLocked();
>   table.releaseWriteLock();
> }
> {code}
> https://github.com/apache/impala/blob/3608ab25f13708b1ba73b0f81abe37c1cda4e342/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L4604-L4641
> tryWriteLock() will wait until timeout(2h) to get the table write lock. If 
> fails, it throws an InternalException. The finally-clause of 
> releaseWriteLock() is always invoked so it fails by lock not held by current 
> thread.
> {code:java}
>   private void tryWriteLock(Table tbl, String operation) throws 
> InternalException {
> String type = tbl instanceof View ? "view" : "table";
> if (!catalog_.tryWriteLock(tbl)) {
>   throw new InternalException(String.format("Error %s (for) %s %s due to 
> " +
>   "lock contention.", operation, type, tbl.getFullName()));
> }
>   }{code}
> https://github.com/apache/impala/blob/3608ab25f13708b1ba73b0f81abe37c1cda4e342/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L7309-L7323
> We should check if the lock is held by the current 

[jira] [Commented] (IMPALA-12461) Avoid write lock on the table during self-event detection

2024-06-26 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860280#comment-17860280
 ] 

Quanlong Huang commented on IMPALA-12461:
-

[~gsaihemanth]  I think this is not resolved yet since partition level events 
are not handled in commit 78b9285da457c6853e513f3852730867d4dbe632.

> Avoid write lock on the table during self-event detection
> -
>
> Key: IMPALA-12461
> URL: https://issues.apache.org/jira/browse/IMPALA-12461
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Csaba Ringhofer
>Assignee: Csaba Ringhofer
>Priority: Critical
> Fix For: Impala 4.5.0
>
>
> Saw some callstacks like this:
> {code}
>     at 
> org.apache.impala.catalog.CatalogServiceCatalog.tryLock(CatalogServiceCatalog.java:468)
>     at 
> org.apache.impala.catalog.CatalogServiceCatalog.tryWriteLock(CatalogServiceCatalog.java:436)
>     at 
> org.apache.impala.catalog.CatalogServiceCatalog.evaluateSelfEvent(CatalogServiceCatalog.java:1008)
>     at 
> org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.isSelfEvent(MetastoreEvents.java:609)
>     at 
> org.apache.impala.catalog.events.MetastoreEvents$BatchPartitionEvent.process(MetastoreEvents.java:1942)
> {code}
> At this point it was already checked that the event comes from Impala based 
> on service id and now we are checking the table's self event list. Taking the 
> table lock can be problematic as other DDL may took write lock at the same 
> time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13178) Flush the metadata cache to remote storage instead of just invalidating them in full GCs

2024-06-25 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13178:
---

 Summary: Flush the metadata cache to remote storage instead of 
just invalidating them in full GCs
 Key: IMPALA-13178
 URL: https://issues.apache.org/jira/browse/IMPALA-13178
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang


When invalidate_tables_on_memory_pressure is enabled, catalogd will invalidate 
10% (configured by invalidate_tables_fraction_on_memory_pressure) of the tables 
if the old gen usage of JVM still exceeds 60% (configured by 
invalidate_tables_gc_old_gen_full_threshold) after a full GC.

Later if the table is used again, catalogd will try to load its metadata. The 
loading process could also lead to OOM (see IMPALA-13117).

On the other hand, the metadata might have no changes so it's a waste to evict 
and reload them again. Fetching all the partitions from HMS and file listing on 
the storage are expensive. It'd be better to flush out the metadata cache of a 
table instead of just invalidating it. If there are no more invalidates (either 
implicit ones from HMS event processing or explicit ones from user commands) on 
the table, we can reuse the flushed metadata.

They can be flushed to the remote storage (e.g. HDFS/Ozone/S3) so catalogd has 
unlimited space to use. We can consider just flushing out the 
encodedFileDescriptors (the file metadata) and incremental stats which are 
usually the majority of the metadata cache. Or use a well-defined format (e.g. 
Iceberg manifest files) so we can incrementally flush the metadata even with 
catalog changes (DDL/DMLs).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13178) Flush the metadata cache to remote storage instead of just invalidating them in full GCs

2024-06-25 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13178:
---

 Summary: Flush the metadata cache to remote storage instead of 
just invalidating them in full GCs
 Key: IMPALA-13178
 URL: https://issues.apache.org/jira/browse/IMPALA-13178
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang


When invalidate_tables_on_memory_pressure is enabled, catalogd will invalidate 
10% (configured by invalidate_tables_fraction_on_memory_pressure) of the tables 
if the old gen usage of JVM still exceeds 60% (configured by 
invalidate_tables_gc_old_gen_full_threshold) after a full GC.

Later if the table is used again, catalogd will try to load its metadata. The 
loading process could also lead to OOM (see IMPALA-13117).

On the other hand, the metadata might have no changes so it's a waste to evict 
and reload them again. Fetching all the partitions from HMS and file listing on 
the storage are expensive. It'd be better to flush out the metadata cache of a 
table instead of just invalidating it. If there are no more invalidates (either 
implicit ones from HMS event processing or explicit ones from user commands) on 
the table, we can reuse the flushed metadata.

They can be flushed to the remote storage (e.g. HDFS/Ozone/S3) so catalogd has 
unlimited space to use. We can consider just flushing out the 
encodedFileDescriptors (the file metadata) and incremental stats which are 
usually the majority of the metadata cache. Or use a well-defined format (e.g. 
Iceberg manifest files) so we can incrementally flush the metadata even with 
catalog changes (DDL/DMLs).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IMPALA-13117) Improve the heap usage during metadata loading and DDL/DML executions

2024-06-25 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859823#comment-17859823
 ] 

Quanlong Huang commented on IMPALA-13117:
-

Ideally the overhead of metadata loading, i.e. temp objects created during 
metadata loading, should be negligible comparing to the HdfsTable itself. 
However, a heap dump during the metadata loading reveals that we are holding 
the FileDescriptor objects until the parallel file metadata loading finishes.

!Selection_125.png|width=561,height=365!

Note that the table has small files issue so the memory space is mostly 
occupied by file metadata. Each FileDescriptor object takes 256B. The 
encodedFileDescriptor (the byte array inside it) just takes 160B. 

The FileDescriptors are unwrapped after all the loads on all partitions are 
finished:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L161]
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1585-L1586]

This introduces an overhead of 60% mem space during metadata loading comparing 
to the actual space needed to cache the metadata. We should unwrap the 
FileDescriptors in time just after they are generated.

> Improve the heap usage during metadata loading and DDL/DML executions
> -
>
> Key: IMPALA-13117
> URL: https://issues.apache.org/jira/browse/IMPALA-13117
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>  Labels: catalog-2024
> Attachments: Selection_125.png
>
>
> The JVM heap size of catalogd is not just used by the metadata cache. The 
> in-progress metadata loading threads and DDL/DML executions also creates temp 
> objects, which introduces spikes in the heap usage. We should improve the 
> heap usage in this part, especially when the metadata loading is slow due to 
> external slowness (e.g. listing files on S3).
> CC [~mylogi...@gmail.com] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13117) Improve the heap usage during metadata loading and DDL/DML executions

2024-06-25 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13117:

Attachment: Selection_125.png

> Improve the heap usage during metadata loading and DDL/DML executions
> -
>
> Key: IMPALA-13117
> URL: https://issues.apache.org/jira/browse/IMPALA-13117
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>  Labels: catalog-2024
> Attachments: Selection_125.png
>
>
> The JVM heap size of catalogd is not just used by the metadata cache. The 
> in-progress metadata loading threads and DDL/DML executions also creates temp 
> objects, which introduces spikes in the heap usage. We should improve the 
> heap usage in this part, especially when the metadata loading is slow due to 
> external slowness (e.g. listing files on S3).
> CC [~mylogi...@gmail.com] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-25 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:

Description: 
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]

Files of that table are created by Spark jobs. Here are some file names inside 
the same partition:
{noformat}
part-0-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-1-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-2-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-3-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-4-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-5-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-6-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-7-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-8-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-9-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00010-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00011-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00012-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00013-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00014-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00015-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
 {noformat}
By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.

We can consider only do this for partitions whose number of files exceeds a 
threshold (e.g. 10).

  was:
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
!Selection_124.png|width=410,height=172!

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.

We can consider only do this for partitions whose number of files exceeds a 
threshold (e.g. 10).


> Compress encodedFileDescriptors inside the same partition
> -
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>  Labels: catalog-2024
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 

[jira] [Updated] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-24 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:

Description: 
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
!Selection_124.png|width=410,height=172!

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.

We can consider only do this for partitions whose number of files exceeds a 
threshold (e.g. 10).

  was:
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
!Selection_124.png|width=410,height=172!

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.


> Compress encodedFileDescriptors inside the same partition
> -
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>  Labels: catalog-2024
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]
> Files of that table are created by Spark jobs. An example file name: 
> part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
> Here are some file names inside the same partition:
> !Selection_124.png|width=410,height=172!
> By compressing the encodedFileDescriptors inside the same partition, we 
> should be able to save a significant memory space in this case. Compressing 
> all of them inside the same table might be even better, but it impacts the 
> performance when coordinator loading specific partitions from catalogd.
> We can consider only do this for partitions whose number of files exceeds a 
> threshold (e.g. 10).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-24 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:

Labels: catalog-2024  (was: )

> Compress encodedFileDescriptors inside the same partition
> -
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>  Labels: catalog-2024
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]
> Files of that table are created by Spark jobs. An example file name: 
> part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
> Here are some file names inside the same partition:
> !Selection_124.png|width=410,height=172!
> By compressing the encodedFileDescriptors inside the same partition, we 
> should be able to save a significant memory space in this case. Compressing 
> all of them inside the same table might be even better, but it impacts the 
> performance when coordinator loading specific partitions from catalogd.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-24 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13177:
---

 Summary: Compress encodedFileDescriptors inside the same partition
 Key: IMPALA-13177
 URL: https://issues.apache.org/jira/browse/IMPALA-13177
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang
 Attachments: Selection_124.png

File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-24 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13177:
---

 Summary: Compress encodedFileDescriptors inside the same partition
 Key: IMPALA-13177
 URL: https://issues.apache.org/jira/browse/IMPALA-13177
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang
 Attachments: Selection_124.png

File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-24 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:

Description: 
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
!Selection_124.png|width=410,height=172!

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.

  was:
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
 !Selection_124.png! 

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.


> Compress encodedFileDescriptors inside the same partition
> -
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]
> Files of that table are created by Spark jobs. An example file name: 
> part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
> Here are some file names inside the same partition:
> !Selection_124.png|width=410,height=172!
> By compressing the encodedFileDescriptors inside the same partition, we 
> should be able to save a significant memory space in this case. Compressing 
> all of them inside the same table might be even better, but it impacts the 
> performance when coordinator loading specific partitions from catalogd.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-24 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:

Description: 
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
 !Selection_124.png! 

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.

  was:
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.


> Compress encodedFileDescriptors inside the same partition
> -
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723
> Files of that table are created by Spark jobs. An example file name: 
> part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
> Here are some file names inside the same partition:
>  !Selection_124.png! 
> By compressing the encodedFileDescriptors inside the same partition, we 
> should be able to save a significant memory space in this case. Compressing 
> all of them inside the same table might be even better, but it impacts the 
> performance when coordinator loading specific partitions from catalogd.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-13170) InconsistentMetadataFetchException due to database dropped when showing databases

2024-06-24 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17859619#comment-17859619
 ] 

Quanlong Huang commented on IMPALA-13170:
-

Uploaded a patch for review: https://gerrit.cloudera.org/#/c/21546/

> InconsistentMetadataFetchException due to database dropped when showing 
> databases
> -
>
> Key: IMPALA-13170
> URL: https://issues.apache.org/jira/browse/IMPALA-13170
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 3.4.0
>Reporter: Yida Wu
>Assignee: Quanlong Huang
>Priority: Major
>
> Using impalad 3.4.0, an InconsistentMetadataFetchException occurs when 
> running "show databases" in Impala while simultaneously executing "drop 
> database" to drop the newly created database in Hive.
> Step is:
> 1, Creates database (Hive)
> 2, Creates tables (Hive)
> 3, Drops tables (Hive)
> 4, Run show databases (Impala)  Drop database (Hive)
> Logs in Impalad:
> {code:java}
> I0610 02:18:32.435815 278475 CatalogdMetaProvider.java:1354] 1:2] 
> Invalidated objects in cache: [list of database names, HMS_METADATA for DB 
> test_hive]
> I0610 02:18:32.436224 278475 jni-util.cc:288] 1:2] 
> org.apache.impala.catalog.local.InconsistentMetadataFetchException: Fetching 
> DATABASE failed. Could not find TCatalogObject(type:DATABASE, 
> catalog_version:0, db:TDatabase(db_name:test_hive))   
>   
>   
> 
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.sendRequest(CatalogdMetaProvider.java:424)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.access$100(CatalogdMetaProvider.java:185)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:643)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:638)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadWithCaching(CatalogdMetaProvider.java:521)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadDb(CatalogdMetaProvider.java:635)
>   at org.apache.impala.catalog.local.LocalDb.getMetaStoreDb(LocalDb.java:91) 
>   at org.apache.impala.catalog.local.LocalDb.getOwnerUser(LocalDb.java:294)
>   at org.apache.impala.service.Frontend.getDbs(Frontend.java:1066)
>   at org.apache.impala.service.JniFrontend.getDbs(JniFrontend.java:301)
> I0610 02:18:32.436257 278475 status.cc:129] 1:2] 
> InconsistentMetadataFetchException: Fetching DATABASE failed. Could not find 
> TCatalogObject(type:DATABASE, catalog_version:0, 
> {code}
> Logs in Catalog:
> {code:java}
> I0610 02:18:16.190133 222885 MetastoreEvents.java:505] EventId: 141467532 
> EventType: CREATE_DATABASE Successfully added database test_hive 
> ...
> I0610 02:18:32.276082 222885 MetastoreEvents.java:516] EventId: 141467562 
> EventType: DROP_DATABASE Creating event 141467562 of type DROP_DATABASE on 
> database test_hive
> I0610 02:18:32.277876 222885 MetastoreEvents.java:254] Total number of events 
> received: 6 Total number of events filtered out: 0
> I0610 02:18:32.277910 222885 MetastoreEvents.java:258] Incremented skipped 
> metric to 2564
> I0610 02:18:32.279537 222885 MetastoreEvents.java:505] EventId: 141467562 
> EventType: DROP_DATABASE Removed Database test_hive
> {code}
> The case is similar to IMPALA-9441. We may want to handle the error in a 
> better way in Frontend.getDbs().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12979) Wildcard in CLASSPATH might not work in the RPM package

2024-06-21 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-12979.
-
Resolution: Fixed

> Wildcard in CLASSPATH might not work in the RPM package
> ---
>
> Key: IMPALA-12979
> URL: https://issues.apache.org/jira/browse/IMPALA-12979
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 3.4.2
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
> Fix For: Impala 3.4.2
>
>
> I tried deploying the RPM package of Impala-3.4.2 (commit 8e9c5a5) on CentOS 
> 7.9 and found launching catalogd failed by the following error (in 
> catalogd.INFO):
> {noformat}
> Wrote minidump to 
> /var/log/impala-minidumps/catalogd/5e3c8819-0593-4943-555addbc-665470ad.dmp
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x02baf14c, pid=156082, tid=0x7fec0dce59c0
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_141-b15) (build 
> 1.8.0_141-b15)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.141-b15 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # C  [catalogd+0x27af14c]  
> llvm::SCEVAddRecExpr::getNumIterationsInRange(llvm::ConstantRange const&, 
> llvm::ScalarEvolution&) const+0x73c
> #
> # Core dump written. Default location: /opt/impala/core or core.156082
> #
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid156082.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> # {noformat}
> There are other logs in catalogd.ERROR
> {noformat}
> Log file created at: 2024/04/08 04:49:28
> Running on machine: ccycloud-1.quanlong.root.comops.site
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> E0408 04:49:28.979386 158187 logging.cc:146] stderr will be logged to this 
> file.
> Wrote minidump to 
> /var/log/impala-minidumps/catalogd/6c3f550c-be96-4a5b-61171aac-0de15155.dmp
> could not find method getRootCauseMessage from class (null) with signature 
> (Ljava/lang/Throwable;)Ljava/lang/String;
> could not find method getStackTrace from class (null) with signature 
> (Ljava/lang/Throwable;)Ljava/lang/String;
> FileSystem: loadFileSystems failed error:
> (unable to get root cause for java.lang.NoClassDefFoundError)
> (unable to get stack trace for java.lang.NoClassDefFoundError){noformat}
> Resolving the minidump shows me the following stacktrace:
> {noformat}
> (gdb) bt
> #0  0x02baf14c in ?? ()
> #1  0x02baee24 in getJNIEnv ()
> #2  0x02bacb71 in hdfsBuilderConnect ()
> #3  0x012e6ae2 in impala::JniUtil::InitLibhdfs() ()
> #4  0x012e7897 in impala::JniUtil::Init() ()
> #5  0x00be9297 in impala::InitCommonRuntime(int, char**, bool, 
> impala::TestInfo::Mode) ()
> #6  0x00bb604a in CatalogdMain(int, char**) ()
> #7  0x00b33f97 in main (){noformat}
> It indicates something wrong in initializing the JVM. Here are the env vars:
> {noformat}
> Environment Variables:
> JAVA_HOME=/usr/java/jdk1.8.0_141
> CLASSPATH=/opt/impala/conf:/opt/impala/jar/*
> PATH=/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin
> LD_LIBRARY_PATH=/opt/impala/lib/:/usr/java/jdk1.8.0_141/jre/lib/amd64/server:/usr/java/jdk1.8.0_141/jre/lib/amd64
> SHELL=/bin/bash{noformat}
> We use wildcard "*" in the classpath which seems to be the cause. The issue 
> was resolved after using explicit paths in the classpath. Here are what I 
> changed in bin/impala-env.sh:
> {code:bash}
> #export CLASSPATH="/opt/impala/conf:/opt/impala/jar/*"
> CLASSPATH=/opt/impala/conf
> for jar in /opt/impala/jar/*.jar; do
>   CLASSPATH="$CLASSPATH:$jar"
> done
> export CLASSPATH
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IMPALA-12979) Wildcard in CLASSPATH might not work in the RPM package

2024-06-21 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-12979.
-
Resolution: Fixed

> Wildcard in CLASSPATH might not work in the RPM package
> ---
>
> Key: IMPALA-12979
> URL: https://issues.apache.org/jira/browse/IMPALA-12979
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 3.4.2
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
> Fix For: Impala 3.4.2
>
>
> I tried deploying the RPM package of Impala-3.4.2 (commit 8e9c5a5) on CentOS 
> 7.9 and found launching catalogd failed by the following error (in 
> catalogd.INFO):
> {noformat}
> Wrote minidump to 
> /var/log/impala-minidumps/catalogd/5e3c8819-0593-4943-555addbc-665470ad.dmp
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x02baf14c, pid=156082, tid=0x7fec0dce59c0
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_141-b15) (build 
> 1.8.0_141-b15)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.141-b15 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # C  [catalogd+0x27af14c]  
> llvm::SCEVAddRecExpr::getNumIterationsInRange(llvm::ConstantRange const&, 
> llvm::ScalarEvolution&) const+0x73c
> #
> # Core dump written. Default location: /opt/impala/core or core.156082
> #
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid156082.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> # {noformat}
> There are other logs in catalogd.ERROR
> {noformat}
> Log file created at: 2024/04/08 04:49:28
> Running on machine: ccycloud-1.quanlong.root.comops.site
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> E0408 04:49:28.979386 158187 logging.cc:146] stderr will be logged to this 
> file.
> Wrote minidump to 
> /var/log/impala-minidumps/catalogd/6c3f550c-be96-4a5b-61171aac-0de15155.dmp
> could not find method getRootCauseMessage from class (null) with signature 
> (Ljava/lang/Throwable;)Ljava/lang/String;
> could not find method getStackTrace from class (null) with signature 
> (Ljava/lang/Throwable;)Ljava/lang/String;
> FileSystem: loadFileSystems failed error:
> (unable to get root cause for java.lang.NoClassDefFoundError)
> (unable to get stack trace for java.lang.NoClassDefFoundError){noformat}
> Resolving the minidump shows me the following stacktrace:
> {noformat}
> (gdb) bt
> #0  0x02baf14c in ?? ()
> #1  0x02baee24 in getJNIEnv ()
> #2  0x02bacb71 in hdfsBuilderConnect ()
> #3  0x012e6ae2 in impala::JniUtil::InitLibhdfs() ()
> #4  0x012e7897 in impala::JniUtil::Init() ()
> #5  0x00be9297 in impala::InitCommonRuntime(int, char**, bool, 
> impala::TestInfo::Mode) ()
> #6  0x00bb604a in CatalogdMain(int, char**) ()
> #7  0x00b33f97 in main (){noformat}
> It indicates something wrong in initializing the JVM. Here are the env vars:
> {noformat}
> Environment Variables:
> JAVA_HOME=/usr/java/jdk1.8.0_141
> CLASSPATH=/opt/impala/conf:/opt/impala/jar/*
> PATH=/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin
> LD_LIBRARY_PATH=/opt/impala/lib/:/usr/java/jdk1.8.0_141/jre/lib/amd64/server:/usr/java/jdk1.8.0_141/jre/lib/amd64
> SHELL=/bin/bash{noformat}
> We use wildcard "*" in the classpath which seems to be the cause. The issue 
> was resolved after using explicit paths in the classpath. Here are what I 
> changed in bin/impala-env.sh:
> {code:bash}
> #export CLASSPATH="/opt/impala/conf:/opt/impala/jar/*"
> CLASSPATH=/opt/impala/conf
> for jar in /opt/impala/jar/*.jar; do
>   CLASSPATH="$CLASSPATH:$jar"
> done
> export CLASSPATH
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-13170) InconsistentMetadataFetchException due to database dropped when showing databases

2024-06-20 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856633#comment-17856633
 ] 

Quanlong Huang commented on IMPALA-13170:
-

[~baggio000] The exception from JniFrontend.getCatalogMetrics() should be 
resolved after IMPALA-8675 (see IMPALA-11409). After IMPALA-8675, local-catalog 
mode coordinators no longer update the db and table count in 
getCatalogMetrics() so can avoid hitting this.

The current issue happens when SHOW DATABASES want to check visibility for the 
current user. To be specific, Frontend.getDbs() only invokes 
db.getOwernerUser() when 'needsAuthChecks' is true:
{code:java}
  public List getDbs(PatternMatcher matcher, User user)
  throws InternalException {
List dbs = getCatalog().getDbs(matcher);

boolean needsAuthChecks = authzFactory_.getAuthorizationConfig().isEnabled()
  && !userHasAccessForWholeServer(user);

// Filter out the databases the user does not have permissions on.
if (needsAuthChecks) {
  Iterator iter = dbs.iterator();
  List> pendingCheckTasks = Lists.newArrayList();
  while (iter.hasNext()) {
FeDb db = iter.next();
pendingCheckTasks.add(checkAuthorizationPool_.submit(
new CheckAuthorization(db.getName(), null, db.getOwnerUser(), 
user))); <-- Calls db.getOwernerUser() here
  }

  filterUnaccessibleElements(pendingCheckTasks, dbs);
}

return dbs; 
  }{code}
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/service/Frontend.java#L1429]

In local-catalog mode, db.getOwnerUser() could trigger new catalog RPC to fetch 
the metadata of the db. If a db exists when coordinator is calling 
getCatalog().getDbs(matcher), and then being dropped in catalogd before 
coordinator calling db.getOwnerUser(), the error occurs.

The workaround can be retrying the SHOW DATABASES command.

> InconsistentMetadataFetchException due to database dropped when showing 
> databases
> -
>
> Key: IMPALA-13170
> URL: https://issues.apache.org/jira/browse/IMPALA-13170
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 3.4.0
>Reporter: Yida Wu
>Assignee: Quanlong Huang
>Priority: Major
>
> Using impalad 3.4.0, an InconsistentMetadataFetchException occurs when 
> running "show databases" in Impala while simultaneously executing "drop 
> database" to drop the newly created database in Hive.
> Step is:
> 1, Creates database (Hive)
> 2, Creates tables (Hive)
> 3, Drops tables (Hive)
> 4, Run show databases (Impala)  Drop database (Hive)
> Logs in Impalad:
> {code:java}
> I0610 02:18:32.435815 278475 CatalogdMetaProvider.java:1354] 1:2] 
> Invalidated objects in cache: [list of database names, HMS_METADATA for DB 
> test_hive]
> I0610 02:18:32.436224 278475 jni-util.cc:288] 1:2] 
> org.apache.impala.catalog.local.InconsistentMetadataFetchException: Fetching 
> DATABASE failed. Could not find TCatalogObject(type:DATABASE, 
> catalog_version:0, db:TDatabase(db_name:test_hive))   
>   
>   
> 
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.sendRequest(CatalogdMetaProvider.java:424)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.access$100(CatalogdMetaProvider.java:185)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:643)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider$2.call(CatalogdMetaProvider.java:638)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadWithCaching(CatalogdMetaProvider.java:521)
>   at 
> org.apache.impala.catalog.local.CatalogdMetaProvider.loadDb(CatalogdMetaProvider.java:635)
>   at org.apache.impala.catalog.local.LocalDb.getMetaStoreDb(LocalDb.java:91) 
>   at org.apache.impala.catalog.local.LocalDb.getOwnerUser(LocalDb.java:294)
>   at org.apache.impala.service.Frontend.getDbs(Frontend.java:1066)
>   at org.apache.impala.service.JniFrontend.getDbs(JniFrontend.java:301)
> I0610 02:18:32.436257 278475 status.cc:129] 1:2] 
> InconsistentMetadataFetchException: Fetching DATABASE failed. Could not find 
> TCatalogObject(type:DATABASE, catalog_version:0, 
> {code}
> Logs in Catalog:
> {code:java}
> I0610 02:18:16.190133 222885 MetastoreEvents.java:505] EventId: 141467532 
> EventType: CREATE_DATABASE Successfully added database test_hive 
> ...
> I0610 02:18:32.276082 222885 MetastoreEvents.java:516] EventId: 141467562 
> EventType: 

[jira] [Updated] (IMPALA-12051) Propagate analytic tuple predicates of outer-joined InlineView

2024-06-19 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-12051:

Target Version: Impala 4.1.3

> Propagate analytic tuple predicates of outer-joined InlineView
> --
>
> Key: IMPALA-12051
> URL: https://issues.apache.org/jira/browse/IMPALA-12051
> Project: IMPALA
>  Issue Type: Bug
>Reporter: ZhuMinghui
>Assignee: ZhuMinghui
>Priority: Major
> Fix For: Impala 4.3.0
>
> Attachments: image-2023-04-07-11-57-13-571.png, 
> image-2023-04-07-11-57-59-883.png
>
>
> In some cases, direct pushing down predicates that reference analytic tuple 
> into inline view leads to incorrect query results. such as sql:
> {code:java}
> WITH detail_measure AS (
>   SELECT
>     *
>   FROM
>     (
>       VALUES
>         (
>           1 AS `isqbiuar`,
>           1 AS `bgsfrbun`,
>           1 AS `result_type`,
>           1 AS `bjuzzevg`
>         ),
>         (2, 2, 2, 2)
>     ) a
> ),
> order_measure_sql0 AS (
>   SELECT
>     row_number() OVER (
>       ORDER BY
>         row_number_0 DESC NULLS LAST,
>         isqbiuar ASC NULLS LAST
>     ) AS `row_number_0`,
>     `isqbiuar`
>   FROM
>     (
>       VALUES
>         (1 AS `row_number_0`, 1 AS `isqbiuar`),
>         (2, 2)
>     ) b
> )
> SELECT
>   detail_measure.`isqbiuar` AS `isqbiuar`,
>   detail_measure.`bgsfrbun` AS `bgsfrbun`,
>   detail_measure.`result_type` AS `result_type`,
>   detail_measure.`bjuzzevg` AS `bjuzzevg`,
>   `row_number_0` AS `row_number_0`
> FROM
>   detail_measure
>   LEFT JOIN order_measure_sql0 ON order_measure_sql0.isqbiuar = 
> detail_measure.isqbiuar
> WHERE
>   row_number_0 BETWEEN 1
>   AND 1
> ORDER BY
>   `row_number_0` ASC NULLS LAST,
>   `bgsfrbun` ASC NULLS LAST{code}
> The current query result is:
> !image-2023-04-07-11-57-13-571.png!
> The correct query result is:
> !image-2023-04-07-11-57-59-883.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12051) Propagate analytic tuple predicates of outer-joined InlineView

2024-06-19 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-12051:

Fix Version/s: Impala 4.3.0

> Propagate analytic tuple predicates of outer-joined InlineView
> --
>
> Key: IMPALA-12051
> URL: https://issues.apache.org/jira/browse/IMPALA-12051
> Project: IMPALA
>  Issue Type: Bug
>Reporter: ZhuMinghui
>Assignee: ZhuMinghui
>Priority: Major
> Fix For: Impala 4.3.0
>
> Attachments: image-2023-04-07-11-57-13-571.png, 
> image-2023-04-07-11-57-59-883.png
>
>
> In some cases, direct pushing down predicates that reference analytic tuple 
> into inline view leads to incorrect query results. such as sql:
> {code:java}
> WITH detail_measure AS (
>   SELECT
>     *
>   FROM
>     (
>       VALUES
>         (
>           1 AS `isqbiuar`,
>           1 AS `bgsfrbun`,
>           1 AS `result_type`,
>           1 AS `bjuzzevg`
>         ),
>         (2, 2, 2, 2)
>     ) a
> ),
> order_measure_sql0 AS (
>   SELECT
>     row_number() OVER (
>       ORDER BY
>         row_number_0 DESC NULLS LAST,
>         isqbiuar ASC NULLS LAST
>     ) AS `row_number_0`,
>     `isqbiuar`
>   FROM
>     (
>       VALUES
>         (1 AS `row_number_0`, 1 AS `isqbiuar`),
>         (2, 2)
>     ) b
> )
> SELECT
>   detail_measure.`isqbiuar` AS `isqbiuar`,
>   detail_measure.`bgsfrbun` AS `bgsfrbun`,
>   detail_measure.`result_type` AS `result_type`,
>   detail_measure.`bjuzzevg` AS `bjuzzevg`,
>   `row_number_0` AS `row_number_0`
> FROM
>   detail_measure
>   LEFT JOIN order_measure_sql0 ON order_measure_sql0.isqbiuar = 
> detail_measure.isqbiuar
> WHERE
>   row_number_0 BETWEEN 1
>   AND 1
> ORDER BY
>   `row_number_0` ASC NULLS LAST,
>   `bgsfrbun` ASC NULLS LAST{code}
> The current query result is:
> !image-2023-04-07-11-57-13-571.png!
> The correct query result is:
> !image-2023-04-07-11-57-59-883.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13150) Possible buffer overflow in StringVal::CopyFrom()

2024-06-18 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13150:

Fix Version/s: Impala 4.5.0

> Possible buffer overflow in StringVal::CopyFrom()
> -
>
> Key: IMPALA-13150
> URL: https://issues.apache.org/jira/browse/IMPALA-13150
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Daniel Becker
>Assignee: Daniel Becker
>Priority: Major
> Fix For: Impala 4.5.0
>
>
> In {{{}StringVal::CopyFrom(){}}}, we take the 'len' parameter as a 
> {{{}size_t{}}}, which is usually a 64-bit unsigned integer. We pass it to the 
> constructor of {{{}StringVal{}}}, which takes it as an {{{}int{}}}, which is 
> usually a 32-bit signed integer. The constructor then allocates memory for 
> the length using the {{int}} value, but back in {{{}CopyFrom(){}}}, we copy 
> the buffer with the {{size_t}} length. If {{size_t}} is indeed 64 bits and 
> {{int}} is 32 bits, and the value is truncated, we may copy more bytes that 
> what we have allocated the destination for. See 
> https://github.com/apache/impala/blob/ce8078204e5995277f79e226e26fe8b9eaca408b/be/src/udf/udf.cc#L546



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-13160) Impala query stuck after query from special partition 'hour=0' and 'hour=00' which hour type is int

2024-06-17 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855779#comment-17855779
 ] 

Quanlong Huang commented on IMPALA-13160:
-

CC [~mylogi...@gmail.com] [~VenuReddy] [~hemanth619] [~ngangam] for more 
thoughts.

> Impala query stuck after query from special partition 'hour=0' and 'hour=00' 
> which hour type is int
> ---
>
> Key: IMPALA-13160
> URL: https://issues.apache.org/jira/browse/IMPALA-13160
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog, fe
>Affects Versions: Impala 3.4.0, Impala 4.3.0
>Reporter: LiuYuan
>Priority: Critical
>
> 1.When create table as below:
> {code:java}
>  CREATE TABLE hive_partition.two_partition (               
>    id INT,                                                 
>    name STRING                                             
>  )                                                         
>  PARTITIONED BY (                                          
>    day INT,                                                
>    hour INT                                                
>  )                                                         
>  WITH SERDEPROPERTIES ('serialization.format'='1')         
>  STORED AS ORC                                             
>  LOCATION 'hdfs://ly-pfs/hive/hive_partition/two_partition'{code}
> 2.Then create dir as below:
>  
> {code:java}
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=0
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=00
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=01
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=02
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=03
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=04
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=05
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=06
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=07
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=08
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=09
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=1
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=10
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=11
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=12
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=13
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=14
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=15
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=16
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=17
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=18
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=19
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=2
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=20
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=21
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=22
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=23
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=3
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=4
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=5
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=6
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=7
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=8
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=9{code}
>  
> 3. Execute Refresh hive_partition.two_partition more times
> on Impala 3.4.0, total parititons grow after refresh, partitions grows from 
> 34 to 74 after refresh three times
>  
> {code:java}
> I0617 17:01:36.244355 18605 CatalogServiceCatalog.java:2225] Refreshing table 
> metadata: hive_partition.two_partition
> I0617 17:01:38.033699 18605 HdfsTable.java:995] Reloading metadata for table 
> definition and all partition(s) of hive_partition.two_partition (REFRESH 
> issued by root)
> I0617 17:01:39.245016 18605 ParallelFileMetadataLoader.java:147] Loading file 
> and block metadata for 10 paths for table hive_partition.two_partition using 
> a thread pool of size 10
> I0617 17:01:39.336242 18605 HdfsTable.java:690] Loaded file and block 
> metadata for hive_partition.two_partition partitions: day=20240613/hour=0, 
> day=20240613/hour=1, day=20240613/hour=2, and 7 others. Time taken: 91.234ms
> I0617 17:01:39.336658 18605 

[jira] [Commented] (IMPALA-13160) Impala query stuck after query from special partition 'hour=0' and 'hour=00' which hour type is int

2024-06-17 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855778#comment-17855778
 ] 

Quanlong Huang commented on IMPALA-13160:
-

I can reproduce the issue now. The key is the partitions should be created by 
Hive. When running SHOW PARTITIONS in Hive, I can see duplicated partitions 
(e.g. hour=00 duplicates hour=00, hour=01 duplicates hour=1):
{noformat}
+---+
|   partition   |
+---+
| day=20240613/hour=0   |
| day=20240613/hour=00  |
| day=20240613/hour=01  |
| day=20240613/hour=02  |
| day=20240613/hour=03  |
| day=20240613/hour=04  |
| day=20240613/hour=05  |
| day=20240613/hour=06  |
| day=20240613/hour=07  |
| day=20240613/hour=08  |
| day=20240613/hour=09  |
| day=20240613/hour=1   |
| day=20240613/hour=10  |
| day=20240613/hour=11  |
| day=20240613/hour=12  |
| day=20240613/hour=13  |
| day=20240613/hour=14  |
| day=20240613/hour=15  |
| day=20240613/hour=16  |
| day=20240613/hour=17  |
| day=20240613/hour=18  |
| day=20240613/hour=19  |
| day=20240613/hour=2   |
| day=20240613/hour=20  |
| day=20240613/hour=21  |
| day=20240613/hour=22  |
| day=20240613/hour=23  |
| day=20240613/hour=3   |
| day=20240613/hour=4   |
| day=20240613/hour=5   |
| day=20240613/hour=6   |
| day=20240613/hour=7   |
| day=20240613/hour=8   |
| day=20240613/hour=9   |
+---+
34 rows selected (0.103 seconds){noformat}
However, partitions are not referenced correctly in the query. E.g. inserting a 
row to hour=00 actually inserts to hour=0
{code:sql}
hive> insert into hive_partition.two_partition partition(day=20240613, hour=00) 
select 1, 'name';
{code}
The file is created as 
'hdfs://localhost:20500/test-warehouse/hive_partition.db/two_partition/day=20240613/hour=0/00_0'
which is under the partition dir of hour=0.

Using local-catalog mode in Impala can fix the hanging issue. However, query 
results could be unexpected. This seems to be a gray area of both Hive and 
Impala.

> Impala query stuck after query from special partition 'hour=0' and 'hour=00' 
> which hour type is int
> ---
>
> Key: IMPALA-13160
> URL: https://issues.apache.org/jira/browse/IMPALA-13160
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog, fe
>Affects Versions: Impala 3.4.0, Impala 4.3.0
>Reporter: LiuYuan
>Priority: Critical
>
> 1.When create table as below:
> {code:java}
>  CREATE TABLE hive_partition.two_partition (               
>    id INT,                                                 
>    name STRING                                             
>  )                                                         
>  PARTITIONED BY (                                          
>    day INT,                                                
>    hour INT                                                
>  )                                                         
>  WITH SERDEPROPERTIES ('serialization.format'='1')         
>  STORED AS ORC                                             
>  LOCATION 'hdfs://ly-pfs/hive/hive_partition/two_partition'{code}
> 2.Then create dir as below:
>  
> {code:java}
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=0
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=00
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=01
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=02
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=03
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=04
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=05
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=06
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=07
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=08
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=09
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=1
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=10
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=11
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=12
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=13
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=14
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=15
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=16
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=17
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=18
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=19
> 

[jira] [Updated] (IMPALA-13160) Impala query stuck after query from special partition 'hour=0' and 'hour=00' which hour type is int

2024-06-17 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13160:

Priority: Critical  (was: Major)

> Impala query stuck after query from special partition 'hour=0' and 'hour=00' 
> which hour type is int
> ---
>
> Key: IMPALA-13160
> URL: https://issues.apache.org/jira/browse/IMPALA-13160
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog, fe
>Affects Versions: Impala 3.4.0, Impala 4.3.0
>Reporter: LiuYuan
>Priority: Critical
>
> 1.When create table as below:
> {code:java}
>  CREATE TABLE hive_partition.two_partition (               
>    id INT,                                                 
>    name STRING                                             
>  )                                                         
>  PARTITIONED BY (                                          
>    day INT,                                                
>    hour INT                                                
>  )                                                         
>  WITH SERDEPROPERTIES ('serialization.format'='1')         
>  STORED AS ORC                                             
>  LOCATION 'hdfs://ly-pfs/hive/hive_partition/two_partition'{code}
> 2.Then create dir as below:
>  
> {code:java}
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=0
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=00
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=01
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=02
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=03
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=04
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=05
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=06
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=07
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=08
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=09
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=1
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=10
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=11
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=12
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=13
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=14
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=15
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=16
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=17
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=18
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=19
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=2
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=20
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=21
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=22
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=23
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=3
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=4
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=5
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=6
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=7
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=8
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=9{code}
>  
> 3. Execute Refresh hive_partition.two_partition more times
> on Impala 3.4.0, total parititons grow after refresh, partitions grows from 
> 34 to 74 after refresh three times
>  
> {code:java}
> I0617 17:01:36.244355 18605 CatalogServiceCatalog.java:2225] Refreshing table 
> metadata: hive_partition.two_partition
> I0617 17:01:38.033699 18605 HdfsTable.java:995] Reloading metadata for table 
> definition and all partition(s) of hive_partition.two_partition (REFRESH 
> issued by root)
> I0617 17:01:39.245016 18605 ParallelFileMetadataLoader.java:147] Loading file 
> and block metadata for 10 paths for table hive_partition.two_partition using 
> a thread pool of size 10
> I0617 17:01:39.336242 18605 HdfsTable.java:690] Loaded file and block 
> metadata for hive_partition.two_partition partitions: day=20240613/hour=0, 
> day=20240613/hour=1, day=20240613/hour=2, and 7 others. Time taken: 91.234ms
> I0617 17:01:39.336658 18605 ParallelFileMetadataLoader.java:147] Refreshing 
> file and block metadata for 34 paths for 

[jira] [Updated] (IMPALA-13161) impalad crash -- impala::DelimitedTextParser::ParseFieldLocations

2024-06-17 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13161:

Component/s: Backend
 (was: be)

> impalad crash -- impala::DelimitedTextParser::ParseFieldLocations
> ---
>
> Key: IMPALA-13161
> URL: https://issues.apache.org/jira/browse/IMPALA-13161
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.0.0, Impala 4.4.0
>Reporter: nyq
>Priority: Critical
>
> Impala version: 4.0.0
> Problem:
> impalad crash, by operating a text table, which has a 3GB data file that only 
> contains '\x00' char
> Steps:
> python -c 'f=open("impala_0_3gb.data.csv", "wb");tmp="\x00"*1024*1024*3; 
> [f.write(tmp) for i in range(1024)] ;f.close()'
> create table impala_0_3gb (id int)
> hdfs dfs -put impala_0_3gb.data.csv /user/hive/warehouse/impala_0_3gb/
> refresh impala_0_3gb
> select count(1) from impala_0_3gb
> Errors:
> Wrote minidump to 1dcf110f-5a2e-49a2-be4eb7a5-4709ed19.dmp
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0181861c, pid=956182, tid=0x7fc6b340e700
> #
> # JRE version: OpenJDK Runtime Environment (8.0) (build 1.8.0)
> # Java VM: OpenJDK 64-Bit Server VM
> # Problematic frame:
> # C  [impalad+0x141861c]  
> impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
> char**, impala::FieldLocation*, int*, int*, char**)+0x7cc
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid956182.log
> #
> #
> C  [impalad+0x141861c]  
> impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
> char**, impala::FieldLocation*, int*, int*, char**)+0x7cc
> C  [impalad+0x136fe11]  
> impala::HdfsTextScanner::ProcessRange(impala::RowBatch*, int*)+0x1a1
> C  [impalad+0x137100e]  
> impala::HdfsTextScanner::FinishScanRange(impala::RowBatch*)+0x3be
> C  [impalad+0x13721ac]  
> impala::HdfsTextScanner::GetNextInternal(impala::RowBatch*)+0x12c
> C  [impalad+0x131cdfc]  impala::HdfsScanner::ProcessSplit()+0x19c
> C  [impalad+0x1443e17]  
> impala::HdfsScanNode::ProcessSplit(std::vector std::allocator > const&, impala::MemPool*, 
> impala::io::ScanRange*, long*)+0x7e7
> C  [impalad+0x1447001]  impala::HdfsScanNode::ScannerThread(bool, long)+0x541



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13161) impalad crash -- impala::DelimitedTextParser::ParseFieldLocations

2024-06-17 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13161:

Affects Version/s: Impala 4.4.0

> impalad crash -- impala::DelimitedTextParser::ParseFieldLocations
> ---
>
> Key: IMPALA-13161
> URL: https://issues.apache.org/jira/browse/IMPALA-13161
> Project: IMPALA
>  Issue Type: Bug
>  Components: be
>Affects Versions: Impala 4.0.0, Impala 4.4.0
>Reporter: nyq
>Priority: Critical
>
> Impala version: 4.0.0
> Problem:
> impalad crash, by operating a text table, which has a 3GB data file that only 
> contains '\x00' char
> Steps:
> python -c 'f=open("impala_0_3gb.data.csv", "wb");tmp="\x00"*1024*1024*3; 
> [f.write(tmp) for i in range(1024)] ;f.close()'
> create table impala_0_3gb (id int)
> hdfs dfs -put impala_0_3gb.data.csv /user/hive/warehouse/impala_0_3gb/
> refresh impala_0_3gb
> select count(1) from impala_0_3gb
> Errors:
> Wrote minidump to 1dcf110f-5a2e-49a2-be4eb7a5-4709ed19.dmp
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0181861c, pid=956182, tid=0x7fc6b340e700
> #
> # JRE version: OpenJDK Runtime Environment (8.0) (build 1.8.0)
> # Java VM: OpenJDK 64-Bit Server VM
> # Problematic frame:
> # C  [impalad+0x141861c]  
> impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
> char**, impala::FieldLocation*, int*, int*, char**)+0x7cc
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid956182.log
> #
> #
> C  [impalad+0x141861c]  
> impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
> char**, impala::FieldLocation*, int*, int*, char**)+0x7cc
> C  [impalad+0x136fe11]  
> impala::HdfsTextScanner::ProcessRange(impala::RowBatch*, int*)+0x1a1
> C  [impalad+0x137100e]  
> impala::HdfsTextScanner::FinishScanRange(impala::RowBatch*)+0x3be
> C  [impalad+0x13721ac]  
> impala::HdfsTextScanner::GetNextInternal(impala::RowBatch*)+0x12c
> C  [impalad+0x131cdfc]  impala::HdfsScanner::ProcessSplit()+0x19c
> C  [impalad+0x1443e17]  
> impala::HdfsScanNode::ProcessSplit(std::vector std::allocator > const&, impala::MemPool*, 
> impala::io::ScanRange*, long*)+0x7e7
> C  [impalad+0x1447001]  impala::HdfsScanNode::ScannerThread(bool, long)+0x541



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-13161) impalad crash -- impala::DelimitedTextParser::ParseFieldLocations

2024-06-17 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855619#comment-17855619
 ] 

Quanlong Huang commented on IMPALA-13161:
-

[~nyq] Thanks for reporting this! I can still reproduce the crash in the master 
branch (commit cce6b349f).
{noformat}
C  [impalad+0x1fc3283]  impala::Status 
impala::DelimitedTextParser::ParseSse(int, long*, char**, char**, 
impala::FieldLocation*, int*, int*, char**)+0x293
C  [impalad+0x1fc3991]  
impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
char**, impala::FieldLocation*, int*, int*, char**)+0x1c9
C  [impalad+0x1c48c45]  
impala::HdfsTextScanner::ProcessRange(impala::RowBatch*, int*)+0x257
C  [impalad+0x1c4b01d]  
impala::HdfsTextScanner::FinishScanRange(impala::RowBatch*)+0x178b
C  [impalad+0x1c4b76b]  
impala::HdfsTextScanner::GetNextInternal(impala::RowBatch*)+0x457
C  [impalad+0x1725d41]  impala::HdfsScanner::ProcessSplit()+0xcf
C  [impalad+0x181b0a4]  
impala::HdfsScanNode::ProcessSplit(std::vector > const&, impala::MemPool*, 
impala::io::ScanRange*, long*)+0xc00
C  [impalad+0x181be8a]  impala::HdfsScanNode::ScannerThread(bool, long)+0x508
C  [impalad+0x181c583]  
impala::ClientRequestState::LogAuditRecord(impala::Status const&)+0x6b3
C  [impalad+0x165525a]  impala::Thread::SuperviseThread {noformat}

> impalad crash -- impala::DelimitedTextParser::ParseFieldLocations
> ---
>
> Key: IMPALA-13161
> URL: https://issues.apache.org/jira/browse/IMPALA-13161
> Project: IMPALA
>  Issue Type: Bug
>  Components: be
>Affects Versions: Impala 4.0.0
>Reporter: nyq
>Priority: Critical
>
> Impala version: 4.0.0
> Problem:
> impalad crash, by operating a text table, which has a 3GB data file that only 
> contains '\x00' char
> Steps:
> python -c 'f=open("impala_0_3gb.data.csv", "wb");tmp="\x00"*1024*1024*3; 
> [f.write(tmp) for i in range(1024)] ;f.close()'
> create table impala_0_3gb (id int)
> hdfs dfs -put impala_0_3gb.data.csv /user/hive/warehouse/impala_0_3gb/
> refresh impala_0_3gb
> select count(1) from impala_0_3gb
> Errors:
> Wrote minidump to 1dcf110f-5a2e-49a2-be4eb7a5-4709ed19.dmp
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0181861c, pid=956182, tid=0x7fc6b340e700
> #
> # JRE version: OpenJDK Runtime Environment (8.0) (build 1.8.0)
> # Java VM: OpenJDK 64-Bit Server VM
> # Problematic frame:
> # C  [impalad+0x141861c]  
> impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
> char**, impala::FieldLocation*, int*, int*, char**)+0x7cc
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid956182.log
> #
> #
> C  [impalad+0x141861c]  
> impala::DelimitedTextParser::ParseFieldLocations(int, long, char**, 
> char**, impala::FieldLocation*, int*, int*, char**)+0x7cc
> C  [impalad+0x136fe11]  
> impala::HdfsTextScanner::ProcessRange(impala::RowBatch*, int*)+0x1a1
> C  [impalad+0x137100e]  
> impala::HdfsTextScanner::FinishScanRange(impala::RowBatch*)+0x3be
> C  [impalad+0x13721ac]  
> impala::HdfsTextScanner::GetNextInternal(impala::RowBatch*)+0x12c
> C  [impalad+0x131cdfc]  impala::HdfsScanner::ProcessSplit()+0x19c
> C  [impalad+0x1443e17]  
> impala::HdfsScanNode::ProcessSplit(std::vector std::allocator > const&, impala::MemPool*, 
> impala::io::ScanRange*, long*)+0x7e7
> C  [impalad+0x1447001]  impala::HdfsScanNode::ScannerThread(bool, long)+0x541



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-13160) Impala query stuck after query from special partition 'hour=0' and 'hour=00' which hour type is int

2024-06-17 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17855590#comment-17855590
 ] 

Quanlong Huang commented on IMPALA-13160:
-

[~liuyuan43] Thanks for reporting this! Unfortunately, I can't reproduce the 
issue using your steps. Did you run ALTER TABLE RECOVER PARTITIONS before the 
REFRESH? If we just run REFRESH after creating the table and hdfs dirs, the 
table will still have 0 partitions.

I can't reproduce the issue even after running ALTER TABLE RECOVER PARTITIONS. 
Please share the commit hash of your version. A complete version string like 
this helps:
{code:java}
impalad version 4.5.0-SNAPSHOT DEBUG (build 
cce6b349f1103c167e2e9ef49fa181ede301b94f){code}
You can find it in the WebUI.

> Impala query stuck after query from special partition 'hour=0' and 'hour=00' 
> which hour type is int
> ---
>
> Key: IMPALA-13160
> URL: https://issues.apache.org/jira/browse/IMPALA-13160
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog, fe
>Affects Versions: Impala 3.4.0, Impala 4.3.0
>Reporter: LiuYuan
>Priority: Major
>
> 1.When create table as below:
> {code:java}
>  CREATE TABLE hive_partition.two_partition (               
>    id INT,                                                 
>    name STRING                                             
>  )                                                         
>  PARTITIONED BY (                                          
>    day INT,                                                
>    hour INT                                                
>  )                                                         
>  WITH SERDEPROPERTIES ('serialization.format'='1')         
>  STORED AS ORC                                             
>  LOCATION 'hdfs://ly-pfs/hive/hive_partition/two_partition'{code}
> 2.Then create dir as below:
>  
> {code:java}
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=0
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=00
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=01
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=02
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=03
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=04
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=05
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=06
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=07
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=08
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=09
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=1
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=10
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=11
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=12
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=13
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=14
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=15
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=16
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=17
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=18
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=19
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=2
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=20
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=21
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=22
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=23
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=3
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=4
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=5
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=6
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=7
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=8
> hdfs://ly-pfs/hive/hive_partition/two_partition/day=20240613/hour=9{code}
>  
> 3. Execute Refresh hive_partition.two_partition more times
> on Impala 3.4.0, total parititons grow after refresh, partitions grows from 
> 34 to 74 after refresh three times
>  
> {code:java}
> I0617 17:01:36.244355 18605 CatalogServiceCatalog.java:2225] Refreshing table 
> metadata: hive_partition.two_partition
> I0617 17:01:38.033699 18605 HdfsTable.java:995] Reloading metadata for table 
> definition and all partition(s) of 

[jira] [Updated] (IMPALA-11648) validate-java-pom-versions.sh should skip pom.xml in toolchain

2024-06-16 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-11648:

Fix Version/s: Impala 3.4.2

> validate-java-pom-versions.sh should skip pom.xml in toolchain
> --
>
> Key: IMPALA-11648
> URL: https://issues.apache.org/jira/browse/IMPALA-11648
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.2.0
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Blocker
> Fix For: Impala 3.4.2, Impala 4.2.0, Impala 4.1.1
>
>
> Building the RC1 tarball of 4.1.1 release failed by 
> bin/validate-java-pom-versions.sh:
> {noformat}
> Check for Java pom.xml versions FAILED
> Expected 4.1.1-RELEASE
> Not found in:
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/accumulo-handler/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/beeline/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/classification/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/cli/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/common/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/contrib/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/druid-handler/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/hbase-handler/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/hcatalog/core/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/hcatalog/hcatalog-pig-adapter/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/hcatalog/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/hcatalog/server-extensions/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/hcatalog/streaming/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/hcatalog/webhcat/java-client/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/hcatalog/webhcat/svr/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/hplsql/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/impala/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/catalogd-unit/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/custom-serde/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/custom-udfs/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/custom-udfs/udf-classloader-udf1/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/custom-udfs/udf-classloader-udf2/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/custom-udfs/udf-classloader-util/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/custom-udfs/udf-vectorized-badexample/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/hcatalog-unit/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/hive-blobstore/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/hive-jmh/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/hive-minikdc/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/hive-unit-hadoop2/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/hive-unit/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/qtest-accumulo/pom.xml
>   
> /root/apache-impala-4.1.1/toolchain/cdp_components-23144489/hive-3.1.3000.7.2.15.0-88/itests/qtest-druid/pom.xml
>   
> 

[jira] [Assigned] (IMPALA-13077) Equality predicate on partition column and uncorrelated subquery doesn't reduce the cardinality estimate

2024-06-13 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang reassigned IMPALA-13077:
---

Assignee: (was: Quanlong Huang)

> Equality predicate on partition column and uncorrelated subquery doesn't 
> reduce the cardinality estimate
> 
>
> Key: IMPALA-13077
> URL: https://issues.apache.org/jira/browse/IMPALA-13077
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Reporter: Quanlong Huang
>Priority: Critical
>
> Let's say 'part_tbl' is a partitioned table. Its partition key is 'part_key'. 
> Consider the following query:
> {code:sql}
> select xxx from part_tbl
> where part_key=(select ... from dim_tbl);
> {code}
> Its query plan is a JoinNode with two ScanNodes. When estimating the 
> cardinality of the JoinNode, the planner is not aware that 'part_key' is the 
> partition column and the cardinality of the JoinNode should not be larger 
> than the max row count across partitions.
> The recent work in IMPALA-12018 (Consider runtime filter for cardinality 
> reduction) helps in some cases since there are runtime filters on the 
> partition column. But there are still some cases that we overestimate the 
> cardinality. For instance, 'ss_sold_date_sk' is the only partition key of 
> tpcds.store_sales. The following query
> {code:sql}
> select count(*) from tpcds.store_sales
> where ss_sold_date_sk=(
>   select min(d_date_sk) + 1000 from tpcds.date_dim);{code}
> has query plan:
> {noformat}
> +-+
> | Explain String  |
> +-+
> | Max Per-Host Resource Reservation: Memory=18.94MB Threads=6 |
> | Per-Host Resource Estimates: Memory=243MB   |
> | |
> | PLAN-ROOT SINK  |
> | |   |
> | 09:AGGREGATE [FINALIZE] |
> | |  output: count:merge(*)   |
> | |  row-size=8B cardinality=1|
> | |   |
> | 08:EXCHANGE [UNPARTITIONED] |
> | |   |
> | 04:AGGREGATE|
> | |  output: count(*) |
> | |  row-size=8B cardinality=1|
> | |   |
> | 03:HASH JOIN [LEFT SEMI JOIN, BROADCAST]|
> | |  hash predicates: ss_sold_date_sk = min(d_date_sk) + 1000 |
> | |  runtime filters: RF000 <- min(d_date_sk) + 1000  |
> | |  row-size=4B cardinality=2.88M < Should be max(numRows) across 
> partitions
> | |   |
> | |--07:EXCHANGE [BROADCAST]  |
> | |  ||
> | |  06:AGGREGATE [FINALIZE]  |
> | |  |  output: min:merge(d_date_sk)  |
> | |  |  row-size=4B cardinality=1 |
> | |  ||
> | |  05:EXCHANGE [UNPARTITIONED]  |
> | |  ||
> | |  02:AGGREGATE |
> | |  |  output: min(d_date_sk)|
> | |  |  row-size=4B cardinality=1 |
> | |  ||
> | |  01:SCAN HDFS [tpcds.date_dim]|
> | | HDFS partitions=1/1 files=1 size=9.84MB   |
> | | row-size=4B cardinality=73.05K|
> | |   |
> | 00:SCAN HDFS [tpcds.store_sales]|
> |HDFS partitions=1824/1824 files=1824 size=346.60MB   |
> |runtime filters: RF000 -> ss_sold_date_sk|
> |row-size=4B cardinality=2.88M|
> +-+{noformat}
> CC [~boroknagyz], [~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13154) Some tables are missing in Top-N Tables with Highest Memory Requirements

2024-06-12 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13154:
---

 Summary: Some tables are missing in Top-N Tables with Highest 
Memory Requirements
 Key: IMPALA-13154
 URL: https://issues.apache.org/jira/browse/IMPALA-13154
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog
Reporter: Quanlong Huang


In the /catalog page of catalogd WebUI, there is a table for "Top-N Tables with 
Highest Memory Requirements". However, not all tables are counted there. E.g. 
after starting catalogd, run a DESCRIBE on a table to trigger metadata loading 
on it. When it's done, the table is not shown in the WebUI.

The cause is that the list is only updated in HdfsTable.getTHdfsTable() when 
'type' is 
ThriftObjectType.FULL:
[https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L2457-L2459]

This used to be the place that all code paths using the table will go to. 
However, we've done bunch of optimizations to not getting the FULL thrift 
object of the table. We should move the code of updating the list of largest 
tables somewhere that all table usages can reach, e.g. after loading the 
metadata of the table, we can update its estimatedMetadataSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13154) Some tables are missing in Top-N Tables with Highest Memory Requirements

2024-06-12 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13154:
---

 Summary: Some tables are missing in Top-N Tables with Highest 
Memory Requirements
 Key: IMPALA-13154
 URL: https://issues.apache.org/jira/browse/IMPALA-13154
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog
Reporter: Quanlong Huang


In the /catalog page of catalogd WebUI, there is a table for "Top-N Tables with 
Highest Memory Requirements". However, not all tables are counted there. E.g. 
after starting catalogd, run a DESCRIBE on a table to trigger metadata loading 
on it. When it's done, the table is not shown in the WebUI.

The cause is that the list is only updated in HdfsTable.getTHdfsTable() when 
'type' is 
ThriftObjectType.FULL:
[https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L2457-L2459]

This used to be the place that all code paths using the table will go to. 
However, we've done bunch of optimizations to not getting the FULL thrift 
object of the table. We should move the code of updating the list of largest 
tables somewhere that all table usages can reach, e.g. after loading the 
metadata of the table, we can update its estimatedMetadataSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IMPALA-13152) IllegalStateException in computing processing cost when there are predicates on analytic output columns

2024-06-11 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853924#comment-17853924
 ] 

Quanlong Huang commented on IMPALA-13152:
-

Assiging this  to [~rizaon] who knows more about this.

> IllegalStateException in computing processing cost when there are predicates 
> on analytic output columns
> ---
>
> Key: IMPALA-13152
> URL: https://issues.apache.org/jira/browse/IMPALA-13152
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Reporter: Quanlong Huang
>Assignee: Riza Suminto
>Priority: Major
>
> Saw an error in the following query when is on:
> {code:sql}
> create table tbl (a int, b int, c int);
> set COMPUTE_PROCESSING_COST=1;
> explain select a, b from (
>   select a, b, c,
> row_number() over(partition by a order by b desc) as latest
>   from tbl
> )b
> WHERE latest=1
> ERROR: IllegalStateException: Processing cost of PlanNode 01:TOP-N is invalid!
> {code}
> Exception in the logs:
> {noformat}
> I0611 13:04:37.192874 28004 jni-util.cc:321] 
> 264ee79bfb6ac031:42f8006c] java.lang.IllegalStateException: 
> Processing cost of PlanNode 01:TOP-N is invalid!
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:512)
> at 
> org.apache.impala.planner.PlanNode.computeRowConsumptionAndProductionToCost(PlanNode.java:1047)
> at 
> org.apache.impala.planner.PlanFragment.computeCostingSegment(PlanFragment.java:287)
> at 
> org.apache.impala.planner.Planner.computeProcessingCost(Planner.java:560)
> at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1932)
> at 
> org.apache.impala.service.Frontend.getPlannedExecRequest(Frontend.java:2892)
> at 
> org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2676)
> at 
> org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2224)
> at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1985)
> at 
> org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175){noformat}
> Don't see the error if removing the predicate "latest=1".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13152) IllegalStateException in computing processing cost when there are predicates on analytic output columns

2024-06-10 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13152:
---

 Summary: IllegalStateException in computing processing cost when 
there are predicates on analytic output columns
 Key: IMPALA-13152
 URL: https://issues.apache.org/jira/browse/IMPALA-13152
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Quanlong Huang
Assignee: Riza Suminto


Saw an error in the following query when is on:
{code:sql}
create table tbl (a int, b int, c int);

set COMPUTE_PROCESSING_COST=1;

explain select a, b from (
  select a, b, c,
row_number() over(partition by a order by b desc) as latest
  from tbl
)b
WHERE latest=1

ERROR: IllegalStateException: Processing cost of PlanNode 01:TOP-N is invalid!
{code}
Exception in the logs:
{noformat}
I0611 13:04:37.192874 28004 jni-util.cc:321] 264ee79bfb6ac031:42f8006c] 
java.lang.IllegalStateException: Processing cost of PlanNode 01:TOP-N is 
invalid!
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:512)
at 
org.apache.impala.planner.PlanNode.computeRowConsumptionAndProductionToCost(PlanNode.java:1047)
at 
org.apache.impala.planner.PlanFragment.computeCostingSegment(PlanFragment.java:287)
at 
org.apache.impala.planner.Planner.computeProcessingCost(Planner.java:560)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1932)
at 
org.apache.impala.service.Frontend.getPlannedExecRequest(Frontend.java:2892)
at 
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2676)
at 
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2224)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1985)
at 
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175){noformat}
Don't see the error if removing the predicate "latest=1".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13152) IllegalStateException in computing processing cost when there are predicates on analytic output columns

2024-06-10 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13152:
---

 Summary: IllegalStateException in computing processing cost when 
there are predicates on analytic output columns
 Key: IMPALA-13152
 URL: https://issues.apache.org/jira/browse/IMPALA-13152
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Quanlong Huang
Assignee: Riza Suminto


Saw an error in the following query when is on:
{code:sql}
create table tbl (a int, b int, c int);

set COMPUTE_PROCESSING_COST=1;

explain select a, b from (
  select a, b, c,
row_number() over(partition by a order by b desc) as latest
  from tbl
)b
WHERE latest=1

ERROR: IllegalStateException: Processing cost of PlanNode 01:TOP-N is invalid!
{code}
Exception in the logs:
{noformat}
I0611 13:04:37.192874 28004 jni-util.cc:321] 264ee79bfb6ac031:42f8006c] 
java.lang.IllegalStateException: Processing cost of PlanNode 01:TOP-N is 
invalid!
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:512)
at 
org.apache.impala.planner.PlanNode.computeRowConsumptionAndProductionToCost(PlanNode.java:1047)
at 
org.apache.impala.planner.PlanFragment.computeCostingSegment(PlanFragment.java:287)
at 
org.apache.impala.planner.Planner.computeProcessingCost(Planner.java:560)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1932)
at 
org.apache.impala.service.Frontend.getPlannedExecRequest(Frontend.java:2892)
at 
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2676)
at 
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2224)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1985)
at 
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175){noformat}
Don't see the error if removing the predicate "latest=1".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IMPALA-13093) Insert into Huawei OBS table failed

2024-06-10 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853843#comment-17853843
 ] 

Quanlong Huang commented on IMPALA-13093:
-

It seems adding this to hdfs-site.xml can also fix the issue:
{code:xml}

fs.obs.file.visibility.enable
true
{code}
I'll check whether OBS returns the real block size.
CC [~michaelsmith] [~eyizoha]

> Insert into Huawei OBS table failed
> ---
>
> Key: IMPALA-13093
> URL: https://issues.apache.org/jira/browse/IMPALA-13093
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 4.3.0
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>
> Insert into a table using Huawei OBS (Object Storage Service) as the storage 
> will failed by the following error:
> {noformat}
> Query: insert into test_obs1 values (1, 'abc')
> ERROR: Failed to get info on temporary HDFS file: 
> obs://obs-test-ee93/input/test_obs1/_impala_insert_staging/fe4ac1be6462a13f_362a9b5b/.fe4ac1be6462a13f-362a9b5b_1213692075_dir//fe4ac1be6462a13f-362a9b5b_375832652_data.0.txt
> Error(2): No such file or directory {noformat}
> Looking into the logs:
> {noformat}
> I0516 16:40:55.663640 18922 status.cc:129] fe4ac1be6462a13f:362a9b5b] 
> Failed to get info on temporary HDFS file: 
> obs://obs-test-ee93/input/test_obs1/_impala_insert_staging/fe4ac1be6462a13f_362a9b5b/.fe4ac1be6462a13f-362a9b5b_1213692075_dir//fe4ac1be6462a13f-362a9b5b_375832652_data.0.txt
> Error(2): No such file or directory
> @   0xfc6d44  impala::Status::Status()
> @  0x1c42020  impala::HdfsTableSink::CreateNewTmpFile()
> @  0x1c44357  impala::HdfsTableSink::InitOutputPartition()
> @  0x1c4988a  impala::HdfsTableSink::GetOutputPartition()
> @  0x1c46569  impala::HdfsTableSink::Send()
> @  0x14ee25f  impala::FragmentInstanceState::ExecInternal()
> @  0x14efca3  impala::FragmentInstanceState::Exec()
> @  0x148dc4c  impala::QueryState::ExecFInstance()
> @  0x1b3bab9  impala::Thread::SuperviseThread()
> @  0x1b3cdb1  boost::detail::thread_data<>::run()
> @  0x2474a87  thread_proxy
> @ 0x7fe5a562dea5  start_thread
> @ 0x7fe5a25ddb0d  __clone{noformat}
> Note that impalad is started with {{--symbolize_stacktrace=true}} so the 
> stacktrace has symbols.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13149) Show JVM info in the WebUI

2024-06-09 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13149:
---

 Summary: Show JVM info in the WebUI
 Key: IMPALA-13149
 URL: https://issues.apache.org/jira/browse/IMPALA-13149
 Project: IMPALA
  Issue Type: New Feature
Reporter: Quanlong Huang


It'd be helpful to show the JVM info in the WebUI, e.g. show the output of 
"java -version":
{code:java}
openjdk version "1.8.0_412"
OpenJDK Runtime Environment (build 1.8.0_412-b08)
OpenJDK 64-Bit Server VM (build 25.412-b08, mixed mode){code}
On nodes just have JRE deployed, we'd like to deploy the same version of JDK to 
perform heap dumps (jmap). Showing the JVM info in the WebUI will be useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-13149) Show JVM info in the WebUI

2024-06-09 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13149:
---

 Summary: Show JVM info in the WebUI
 Key: IMPALA-13149
 URL: https://issues.apache.org/jira/browse/IMPALA-13149
 Project: IMPALA
  Issue Type: New Feature
Reporter: Quanlong Huang


It'd be helpful to show the JVM info in the WebUI, e.g. show the output of 
"java -version":
{code:java}
openjdk version "1.8.0_412"
OpenJDK Runtime Environment (build 1.8.0_412-b08)
OpenJDK 64-Bit Server VM (build 25.412-b08, mixed mode){code}
On nodes just have JRE deployed, we'd like to deploy the same version of JDK to 
perform heap dumps (jmap). Showing the JVM info in the WebUI will be useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13148) Show the number of in-progress Catalog operations

2024-06-09 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13148:

Attachment: Selection_123.png
Selection_122.png

> Show the number of in-progress Catalog operations
> -
>
> Key: IMPALA-13148
> URL: https://issues.apache.org/jira/browse/IMPALA-13148
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Quanlong Huang
>Priority: Major
>  Labels: newbie, ramp-up
> Attachments: Selection_122.png, Selection_123.png
>
>
> In the /operations page of catalogd WebUI, the list of In-progress Catalog 
> Operations are shown. It'd be helpful to also show the number of such 
> operations. Like in the /queries page of coordinator WebUI, it shows 100 
> queries in flight.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13148) Show the number of in-progress Catalog operations

2024-06-09 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13148:
---

 Summary: Show the number of in-progress Catalog operations
 Key: IMPALA-13148
 URL: https://issues.apache.org/jira/browse/IMPALA-13148
 Project: IMPALA
  Issue Type: Improvement
Reporter: Quanlong Huang
 Attachments: Selection_122.png, Selection_123.png

In the /operations page of catalogd WebUI, the list of In-progress Catalog 
Operations are shown. It'd be helpful to also show the number of such 
operations. Like in the /queries page of coordinator WebUI, it shows 100 
queries in flight.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13148) Show the number of in-progress Catalog operations

2024-06-09 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13148:
---

 Summary: Show the number of in-progress Catalog operations
 Key: IMPALA-13148
 URL: https://issues.apache.org/jira/browse/IMPALA-13148
 Project: IMPALA
  Issue Type: Improvement
Reporter: Quanlong Huang
 Attachments: Selection_122.png, Selection_123.png

In the /operations page of catalogd WebUI, the list of In-progress Catalog 
Operations are shown. It'd be helpful to also show the number of such 
operations. Like in the /queries page of coordinator WebUI, it shows 100 
queries in flight.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-13126) ReloadEvent.isOlderEvent() should hold the table read lock

2024-06-03 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13126:
---

 Summary: ReloadEvent.isOlderEvent() should hold the table read lock
 Key: IMPALA-13126
 URL: https://issues.apache.org/jira/browse/IMPALA-13126
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog
Reporter: Quanlong Huang
Assignee: Sai Hemanth Gantasala


Saw an exception like this:
{noformat}
E0601 09:11:25.275251   246 MetastoreEventsProcessor.java:990] Unexpected 
exception received while processing event
Java exception follows:
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1469)
at java.util.HashMap$ValueIterator.next(HashMap.java:1498)
at 
org.apache.impala.catalog.FeFsTable$Utils.getPartitionFromThriftPartitionSpec(FeFsTable.java:616)
at 
org.apache.impala.catalog.HdfsTable.getPartitionFromThriftPartitionSpec(HdfsTable.java:597)
at org.apache.impala.catalog.Catalog.getHdfsPartition(Catalog.java:511)
at org.apache.impala.catalog.Catalog.getHdfsPartition(Catalog.java:489)
at 
org.apache.impala.catalog.CatalogServiceCatalog.isPartitionLoadedAfterEvent(CatalogServiceCatalog.java:4024)
at 
org.apache.impala.catalog.events.MetastoreEvents$ReloadEvent.isOlderEvent(MetastoreEvents.java:2754)
at 
org.apache.impala.catalog.events.MetastoreEvents$ReloadEvent.processTableEvent(MetastoreEvents.java:2729)
at 
org.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.process(MetastoreEvents.java:1107)
at 
org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:531)
at 
org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1164)
at 
org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:972)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) {noformat}
For a partition-level RELOAD event, ReloadEvent.isOlderEvent() needs to check 
whether the corresponding partition is reloaded after the event. This should be 
done after holding the table read lock. Otherwise, EventProcessor could hit the 
error above when there are concurrent DDLs/DMLs modifying the partition list.

CC [~VenuReddy]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13126) ReloadEvent.isOlderEvent() should hold the table read lock

2024-06-03 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13126:
---

 Summary: ReloadEvent.isOlderEvent() should hold the table read lock
 Key: IMPALA-13126
 URL: https://issues.apache.org/jira/browse/IMPALA-13126
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog
Reporter: Quanlong Huang
Assignee: Sai Hemanth Gantasala


Saw an exception like this:
{noformat}
E0601 09:11:25.275251   246 MetastoreEventsProcessor.java:990] Unexpected 
exception received while processing event
Java exception follows:
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1469)
at java.util.HashMap$ValueIterator.next(HashMap.java:1498)
at 
org.apache.impala.catalog.FeFsTable$Utils.getPartitionFromThriftPartitionSpec(FeFsTable.java:616)
at 
org.apache.impala.catalog.HdfsTable.getPartitionFromThriftPartitionSpec(HdfsTable.java:597)
at org.apache.impala.catalog.Catalog.getHdfsPartition(Catalog.java:511)
at org.apache.impala.catalog.Catalog.getHdfsPartition(Catalog.java:489)
at 
org.apache.impala.catalog.CatalogServiceCatalog.isPartitionLoadedAfterEvent(CatalogServiceCatalog.java:4024)
at 
org.apache.impala.catalog.events.MetastoreEvents$ReloadEvent.isOlderEvent(MetastoreEvents.java:2754)
at 
org.apache.impala.catalog.events.MetastoreEvents$ReloadEvent.processTableEvent(MetastoreEvents.java:2729)
at 
org.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.process(MetastoreEvents.java:1107)
at 
org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:531)
at 
org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1164)
at 
org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:972)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) {noformat}
For a partition-level RELOAD event, ReloadEvent.isOlderEvent() needs to check 
whether the corresponding partition is reloaded after the event. This should be 
done after holding the table read lock. Otherwise, EventProcessor could hit the 
error above when there are concurrent DDLs/DMLs modifying the partition list.

CC [~VenuReddy]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-13122) Show file stats in table loading logs

2024-06-02 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13122:
---

 Summary: Show file stats in table loading logs
 Key: IMPALA-13122
 URL: https://issues.apache.org/jira/browse/IMPALA-13122
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Quanlong Huang


Here is an example for table loading logs on a table:
{noformat}
I0603 08:46:05.67 24417 HdfsTable.java:1255] Loading metadata for table 
definition and all partition(s) of tpcds.store_sales (needed by coordinator)
I0603 08:46:05.642702 24417 HdfsTable.java:1896] Loaded 23 columns from HMS. 
Actual columns: 23
I0603 08:46:05.767457 24417 HdfsTable.java:3114] Load Valid Write Id List Done. 
Time taken: 26.699us
I0603 08:46:05.767549 24417 HdfsTable.java:1297] Fetching partition metadata 
from the Metastore: tpcds.store_sales
I0603 08:46:05.806337 24417 MetaStoreUtil.java:190] Fetching 1824 partitions 
for: tpcds.store_sales using partition batch size: 1000 
I0603 08:46:07.336064 24417 MetaStoreUtil.java:208] Fetched 1000/1824 
partitions for table tpcds.store_sales
I0603 08:46:07.915474 24417 MetaStoreUtil.java:208] Fetched 1824/1824 
partitions for table tpcds.store_sales
I0603 08:46:07.915519 24417 HdfsTable.java:1304] Fetched partition metadata 
from the Metastore: tpcds.store_sales
I0603 08:46:08.840034 24417 ParallelFileMetadataLoader.java:224] Loading file 
and block metadata for 1824 paths for table tpcds.store_sales using a thread 
pool of size 5
I0603 08:46:09.383904 24417 HdfsTable.java:836] Loaded file and block metadata 
for tpcds.store_sales partitions: ss_sold_date_sk=2450816, 
ss_sold_date_sk=2450817, ss_sold_date_sk=2450818, and 1821 others. Time taken: 
569.107ms
I0603 08:46:09.420702 24417 Table.java:1117] last refreshed event id for table: 
tpcds.store_sales set to: -1
I0603 08:46:09.420794 24417 TableLoader.java:177] Loaded metadata for: 
tpcds.store_sales (4026ms){noformat}
>From the logs, we know the table has 23 columns and 1824 partitions. Time 
>spent in loading the table schema and file metadata are also shown.

However, it's unknown whether there are small files issue under the partitions. 
The underlying storage could also be slow (e.g. S3) which results in a long 
time in loading file metadata.

It'd be helpful to add these in the logs:
 * number of files loaded
 * min/avg/max of file sizes
 * total file size
 * number of files
 * number of blocks (HDFS only)
 * number of hosts, disks (HDFS/Ozone only)
 * Stats of accessTime and lastModifiedTime

These can be aggregated in FileMetadataLoader#loadInternal() and logged in 
ParallelFileMetadataLoader#load() or HdfsTable#loadFileMetadataForPartitions().

[https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L177]

[https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L172]

[https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L836]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13122) Show file stats in table loading logs

2024-06-02 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13122:
---

 Summary: Show file stats in table loading logs
 Key: IMPALA-13122
 URL: https://issues.apache.org/jira/browse/IMPALA-13122
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Quanlong Huang


Here is an example for table loading logs on a table:
{noformat}
I0603 08:46:05.67 24417 HdfsTable.java:1255] Loading metadata for table 
definition and all partition(s) of tpcds.store_sales (needed by coordinator)
I0603 08:46:05.642702 24417 HdfsTable.java:1896] Loaded 23 columns from HMS. 
Actual columns: 23
I0603 08:46:05.767457 24417 HdfsTable.java:3114] Load Valid Write Id List Done. 
Time taken: 26.699us
I0603 08:46:05.767549 24417 HdfsTable.java:1297] Fetching partition metadata 
from the Metastore: tpcds.store_sales
I0603 08:46:05.806337 24417 MetaStoreUtil.java:190] Fetching 1824 partitions 
for: tpcds.store_sales using partition batch size: 1000 
I0603 08:46:07.336064 24417 MetaStoreUtil.java:208] Fetched 1000/1824 
partitions for table tpcds.store_sales
I0603 08:46:07.915474 24417 MetaStoreUtil.java:208] Fetched 1824/1824 
partitions for table tpcds.store_sales
I0603 08:46:07.915519 24417 HdfsTable.java:1304] Fetched partition metadata 
from the Metastore: tpcds.store_sales
I0603 08:46:08.840034 24417 ParallelFileMetadataLoader.java:224] Loading file 
and block metadata for 1824 paths for table tpcds.store_sales using a thread 
pool of size 5
I0603 08:46:09.383904 24417 HdfsTable.java:836] Loaded file and block metadata 
for tpcds.store_sales partitions: ss_sold_date_sk=2450816, 
ss_sold_date_sk=2450817, ss_sold_date_sk=2450818, and 1821 others. Time taken: 
569.107ms
I0603 08:46:09.420702 24417 Table.java:1117] last refreshed event id for table: 
tpcds.store_sales set to: -1
I0603 08:46:09.420794 24417 TableLoader.java:177] Loaded metadata for: 
tpcds.store_sales (4026ms){noformat}
>From the logs, we know the table has 23 columns and 1824 partitions. Time 
>spent in loading the table schema and file metadata are also shown.

However, it's unknown whether there are small files issue under the partitions. 
The underlying storage could also be slow (e.g. S3) which results in a long 
time in loading file metadata.

It'd be helpful to add these in the logs:
 * number of files loaded
 * min/avg/max of file sizes
 * total file size
 * number of files
 * number of blocks (HDFS only)
 * number of hosts, disks (HDFS/Ozone only)
 * Stats of accessTime and lastModifiedTime

These can be aggregated in FileMetadataLoader#loadInternal() and logged in 
ParallelFileMetadataLoader#load() or HdfsTable#loadFileMetadataForPartitions().

[https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L177]

[https://github.com/apache/impala/blob/9011b81afa33ef7e4b0ec8a367b2713be8917213/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L172]

[https://github.com/apache/impala/blob/ee21427d26620b40d38c706b4944d2831f84f6f5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L836]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-13117) Improve the heap usage during metadata loading and DDL/DML executions

2024-05-30 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13117:
---

 Summary: Improve the heap usage during metadata loading and 
DDL/DML executions
 Key: IMPALA-13117
 URL: https://issues.apache.org/jira/browse/IMPALA-13117
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang


The JVM heap size of catalogd is not just used by the metadata cache. The 
in-progress metadata loading threads and DDL/DML executions also creates temp 
objects, which introduces spikes in the heap usage. We should improve the heap 
usage in this part, especially when the metadata loading is slow due to 
external slowness (e.g. listing files on S3).

CC [~mylogi...@gmail.com] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-13116) In local-catalog mode, abort REFRESH and metadata reloading of DDL/DMLs if the table is invalidated

2024-05-30 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang reassigned IMPALA-13116:
---

Assignee: Quanlong Huang

> In local-catalog mode, abort REFRESH and metadata reloading of DDL/DMLs if 
> the table is invalidated
> ---
>
> Key: IMPALA-13116
> URL: https://issues.apache.org/jira/browse/IMPALA-13116
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>
> A table can be invalidated when there are DDL/DML/REFRESHs running in flight:
>  * User can explictly trigger an INVALIDATE METADATA  command
>  * The table could be invalidated by CatalogdTableInvalidator when 
> invalidate_tables_on_memory_pressure or invalidate_tables_timeout_s is turned 
> on
> Note that invalidating a table doesn't require holding the lock of the 
> HdfsTable object so it can finish even if there are on-going updates on the 
> table.
> The updated HdfsTable object won't be added to the metadata cache since it 
> has been replaced with an IncompleteTable object. It's only used in the 
> DDL/DML/REFRESH responses. In local catalog mode, the response is the minimal 
> representation which is mostly the table name and catalog version. We don't 
> need the updates on the HdfsTable object to be finished. Thus, we can 
> consider aborting the reloading of such DDL/DML/REFRESH requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13117) Improve the heap usage during metadata loading and DDL/DML executions

2024-05-30 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13117:
---

 Summary: Improve the heap usage during metadata loading and 
DDL/DML executions
 Key: IMPALA-13117
 URL: https://issues.apache.org/jira/browse/IMPALA-13117
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang


The JVM heap size of catalogd is not just used by the metadata cache. The 
in-progress metadata loading threads and DDL/DML executions also creates temp 
objects, which introduces spikes in the heap usage. We should improve the heap 
usage in this part, especially when the metadata loading is slow due to 
external slowness (e.g. listing files on S3).

CC [~mylogi...@gmail.com] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-13116) In local-catalog mode, abort REFRESH and metadata reloading of DDL/DMLs if the table is invalidated

2024-05-30 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13116:
---

 Summary: In local-catalog mode, abort REFRESH and metadata 
reloading of DDL/DMLs if the table is invalidated
 Key: IMPALA-13116
 URL: https://issues.apache.org/jira/browse/IMPALA-13116
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Quanlong Huang


A table can be invalidated when there are DDL/DML/REFRESHs running in flight:
 * User can explictly trigger an INVALIDATE METADATA  command
 * The table could be invalidated by CatalogdTableInvalidator when 
invalidate_tables_on_memory_pressure or invalidate_tables_timeout_s is turned on

Note that invalidating a table doesn't require holding the lock of the 
HdfsTable object so it can finish even if there are on-going updates on the 
table.

The updated HdfsTable object won't be added to the metadata cache since it has 
been replaced with an IncompleteTable object. It's only used in the 
DDL/DML/REFRESH responses. In local catalog mode, the response is the minimal 
representation which is mostly the table name and catalog version. We don't 
need the updates on the HdfsTable object to be finished. Thus, we can consider 
aborting the reloading of such DDL/DML/REFRESH requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13116) In local-catalog mode, abort REFRESH and metadata reloading of DDL/DMLs if the table is invalidated

2024-05-30 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13116:
---

 Summary: In local-catalog mode, abort REFRESH and metadata 
reloading of DDL/DMLs if the table is invalidated
 Key: IMPALA-13116
 URL: https://issues.apache.org/jira/browse/IMPALA-13116
 Project: IMPALA
  Issue Type: Improvement
  Components: Catalog
Reporter: Quanlong Huang


A table can be invalidated when there are DDL/DML/REFRESHs running in flight:
 * User can explictly trigger an INVALIDATE METADATA  command
 * The table could be invalidated by CatalogdTableInvalidator when 
invalidate_tables_on_memory_pressure or invalidate_tables_timeout_s is turned on

Note that invalidating a table doesn't require holding the lock of the 
HdfsTable object so it can finish even if there are on-going updates on the 
table.

The updated HdfsTable object won't be added to the metadata cache since it has 
been replaced with an IncompleteTable object. It's only used in the 
DDL/DML/REFRESH responses. In local catalog mode, the response is the minimal 
representation which is mostly the table name and catalog version. We don't 
need the updates on the HdfsTable object to be finished. Thus, we can consider 
aborting the reloading of such DDL/DML/REFRESH requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-13115) Always add the query id in the error message to clients

2024-05-29 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13115:
---

 Summary: Always add the query id in the error message to clients
 Key: IMPALA-13115
 URL: https://issues.apache.org/jira/browse/IMPALA-13115
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Reporter: Quanlong Huang


We have some errors like "Failed due to unreachable impalad(s)". We should 
improve them to mention the query id, e.g. "Query ${query_id} failed due to 
unreachable impalad(s)". In a busy cluster, queries are flushed out quickly in 
the /queries page. Coordinator logs are also flushed out quickly. It's hard to 
find the query id there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13115) Always add the query id in the error message to clients

2024-05-29 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13115:
---

 Summary: Always add the query id in the error message to clients
 Key: IMPALA-13115
 URL: https://issues.apache.org/jira/browse/IMPALA-13115
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Reporter: Quanlong Huang


We have some errors like "Failed due to unreachable impalad(s)". We should 
improve them to mention the query id, e.g. "Query ${query_id} failed due to 
unreachable impalad(s)". In a busy cluster, queries are flushed out quickly in 
the /queries page. Coordinator logs are also flushed out quickly. It's hard to 
find the query id there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IMPALA-12834) Add query load information to the query profile

2024-05-27 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang reassigned IMPALA-12834:
---

Assignee: YifanZhang

> Add query load information to the query profile
> ---
>
> Key: IMPALA-12834
> URL: https://issues.apache.org/jira/browse/IMPALA-12834
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Perf Investigation
>Reporter: YifanZhang
>Assignee: YifanZhang
>Priority: Minor
> Fix For: Impala 4.4.0
>
>
> Add query load information to the query profile to track if the performance 
> regression is related to the insufficient resources of the node, and also 
> recommend if the current pool configurations or host configurations are 
> optimal.
> The load information should include:
>  * Number of running queries of the executor group on which the query is 
> scheduled
>  * Number of running fragment instances of the hosts on which the query is 
> scheduled
>  * Used/Reserved memory of the hosts on which the query is scheduled
>  * Some other useful metrics



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12182) Add CPU utilization time series graph for RuntimeProfile's sampled values

2024-05-27 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-12182:

Fix Version/s: Impala 4.3.0

> Add CPU utilization time series graph for RuntimeProfile's sampled values
> -
>
> Key: IMPALA-12182
> URL: https://issues.apache.org/jira/browse/IMPALA-12182
> Project: IMPALA
>  Issue Type: New Feature
>Reporter: Surya Hebbar
>Assignee: Surya Hebbar
>Priority: Major
> Fix For: Impala 4.3.0
>
> Attachments: 23-07-10_T15_33_44.png, 23-07-10_T15_36_26.png, 
> 23-07-10_T15_39_01.png, 23-07-10_T15_39_31.png, 23-07-10_T15_40_42.png, 
> 23-07-10_T15_40_50.png, 23-07-10_T15_40_55.png, cpu_utilization.png, 
> cpu_utilization_test-1.png, cpu_utilization_test-2.png, query_timeline.mkv, 
> simplescreenrecorder-2023-07-10_21.10.58.mkv, 
> simplescreenrecorder-2023-07-10_22.10.18.mkv, three_nodes.png, 
> three_nodes_zoomed_out.png, timeseries_cpu_utilization_line_plot.mkv, 
> two_nodes.png
>
>
> The RuntimeProfile contains samples of CPU utilization metrics for user, sys 
> and iowait clamped to 64 values (retrieved from the ChunkedTimeSeriesCounter, 
> but sampled similar to SamplingTimeSeriesCounter). 
> It would be helpful to see the recent aggregate CPU node utilization samples 
> for each of the different nodes.
> These are sampled every `periodic_counter_update_period_ms`.
> AggregatedRuntimeProfile used in the Thrift profile contains the complete 
> series of values from the ChunkedTimeSeriesCounter samples. But, as this 
> representation is difficult to provide in the JSON, they have been 
> downsampled to 64 values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12364) Display disk and network metrics in webUI's query timeline

2024-05-27 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-12364:

Fix Version/s: Impala 4.4.0

> Display disk and network metrics in webUI's query timeline
> --
>
> Key: IMPALA-12364
> URL: https://issues.apache.org/jira/browse/IMPALA-12364
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Surya Hebbar
>Assignee: Surya Hebbar
>Priority: Major
> Fix For: Impala 4.4.0
>
> Attachments: average_disk_network_metrics.mkv, 
> averaged_disk_network_metrics.png, both_charts_resize.mkv, 
> both_charts_resize.png, close_cpu_utilization_button.mkv, 
> draggable_resize_handle.png, hor_zoom_buttons.png, 
> horizontal_zoom_buttons.mkv, host_utilization_chart_resize.mkv, 
> host_utilization_close_button.png, host_utilization_resize_bar.png, 
> multiple_fragment_metrics.png, resize_drag_handle.mkv
>
>
> It would be helpful to display disk and network usage in human readable form 
> on the query timeline, aligning it along with the CPU utilization plot, below 
> the fragment timing diagram.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-11915) Support timeline and graphical plan exports in the webUI

2024-05-27 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-11915:

Fix Version/s: Impala 4.3.0

> Support timeline and graphical plan exports in the webUI
> 
>
> Key: IMPALA-11915
> URL: https://issues.apache.org/jira/browse/IMPALA-11915
> Project: IMPALA
>  Issue Type: New Feature
>Reporter: Quanlong Huang
>Assignee: Surya Hebbar
>Priority: Major
>  Labels: supportability
> Fix For: Impala 4.3.0
>
> Attachments: export_button.png, export_modal.png, 
> export_plan_example_70b4ecc5f6aec963e_85221a3b_plan.html, 
> export_timeline_example_0b4ecc5f6aec963e_85221a3b_timeline.svg, 
> exported_plan.png, exported_timeline.png, plan_download.png, 
> plan_download_button.png, plan_export.png, plan_export_modal.png, 
> plan_export_text_selection.png, svg_wrapped_export.html, text_selection.png, 
> timeline_download-1.png, timeline_download.png, timeline_download_button.png, 
> timeline_export.png, timeline_export_modal.png, 
> timeline_export_text_selection-1.png, timeline_export_text_selection.png
>
>
> The graphical plan in the web UI is useful. It'd be nice to provide a button 
> to download the svg picture.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12178) Refined alignment of timeticks in the webUI timeline

2024-05-27 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-12178:

Fix Version/s: Impala 4.3.0

> Refined alignment of timeticks in the webUI timeline
> 
>
> Key: IMPALA-12178
> URL: https://issues.apache.org/jira/browse/IMPALA-12178
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Surya Hebbar
>Assignee: Surya Hebbar
>Priority: Minor
> Fix For: Impala 4.3.0
>
> Attachments: overflowed_timetick_label.png, timetick_label_fixed.png
>
>
> The timeticks on the query timeline page in the WebUI were partially being 
> hidden due to the overflow for long timestamps after SVG rendering.
> It would be better if the entire timtick label is displayed appropriately. 
> !overflowed_timetick_label.png|width=808,height=259!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-13102) Loading tables with illegal stats failed

2024-05-23 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-13102.
-
Fix Version/s: Impala 4.5.0
   Resolution: Fixed

> Loading tables with illegal stats failed
> 
>
> Key: IMPALA-13102
> URL: https://issues.apache.org/jira/browse/IMPALA-13102
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
> Fix For: Impala 4.5.0
>
>
> When the table has illegal stats, e.g. numDVs=-100, Impala can't load the 
> table. So DROP STATS or DROP TABLE can't be perform on the table.
> {code:sql}
> [localhost:21050] default> drop stats alltypes_bak;
> Query: drop stats alltypes_bak
> ERROR: AnalysisException: Failed to load metadata for table: 'alltypes_bak'
> CAUSED BY: TableLoadingException: Failed to load metadata for table: 
> default.alltypes_bak
> CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}{code}
> We should allow at least dropping the stats or dropping the table. So user 
> can use Impala to recover the stats.
> Stacktrace in the logs:
> {noformat}
> I0520 08:00:56.661746 17543 jni-util.cc:321] 
> 5343142d1173494f:44dcde8c] 
> org.apache.impala.common.AnalysisException: Failed to load metadata for 
> table: 'alltypes_bak'
> at 
> org.apache.impala.analysis.Analyzer.resolveTableRef(Analyzer.java:974)
> at 
> org.apache.impala.analysis.DropStatsStmt.analyze(DropStatsStmt.java:94)
> at 
> org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:551)
> at 
> org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:498)
> at 
> org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2542)
> at 
> org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2224)
> at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1985)
> at 
> org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175)
> Caused by: org.apache.impala.catalog.TableLoadingException: Failed to load 
> metadata for table: default.alltypes_bak
> CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
> at 
> org.apache.impala.catalog.IncompleteTable.loadFromThrift(IncompleteTable.java:162)
> at org.apache.impala.catalog.Table.fromThrift(Table.java:586)
> at 
> org.apache.impala.catalog.ImpaladCatalog.addTable(ImpaladCatalog.java:479)
> at 
> org.apache.impala.catalog.ImpaladCatalog.addCatalogObject(ImpaladCatalog.java:334)
> at 
> org.apache.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:262)
> at 
> org.apache.impala.service.FeCatalogManager$CatalogdImpl.updateCatalogCache(FeCatalogManager.java:114)
> at 
> org.apache.impala.service.Frontend.updateCatalogCache(Frontend.java:585)
> at 
> org.apache.impala.service.JniFrontend.updateCatalogCache(JniFrontend.java:196)
> at .: 
> org.apache.impala.catalog.TableLoadingException: Failed to load metadata for 
> table: default.alltypes_bak
> at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1318)
> at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1213)
> at org.apache.impala.catalog.TableLoader.load(TableLoader.java:145)
> at 
> org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:251)
> at 
> org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:247)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:512)
> at 
> org.apache.impala.catalog.ColumnStats.validate(ColumnStats.java:1034)
> at org.apache.impala.catalog.ColumnStats.update(ColumnStats.java:676)
> at org.apache.impala.catalog.Column.updateStats(Column.java:73)
> at 
> org.apache.impala.catalog.FeCatalogUtils.injectColumnStats(FeCatalogUtils.java:183)
> at 

[jira] [Resolved] (IMPALA-13102) Loading tables with illegal stats failed

2024-05-23 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-13102.
-
Fix Version/s: Impala 4.5.0
   Resolution: Fixed

> Loading tables with illegal stats failed
> 
>
> Key: IMPALA-13102
> URL: https://issues.apache.org/jira/browse/IMPALA-13102
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
> Fix For: Impala 4.5.0
>
>
> When the table has illegal stats, e.g. numDVs=-100, Impala can't load the 
> table. So DROP STATS or DROP TABLE can't be perform on the table.
> {code:sql}
> [localhost:21050] default> drop stats alltypes_bak;
> Query: drop stats alltypes_bak
> ERROR: AnalysisException: Failed to load metadata for table: 'alltypes_bak'
> CAUSED BY: TableLoadingException: Failed to load metadata for table: 
> default.alltypes_bak
> CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}{code}
> We should allow at least dropping the stats or dropping the table. So user 
> can use Impala to recover the stats.
> Stacktrace in the logs:
> {noformat}
> I0520 08:00:56.661746 17543 jni-util.cc:321] 
> 5343142d1173494f:44dcde8c] 
> org.apache.impala.common.AnalysisException: Failed to load metadata for 
> table: 'alltypes_bak'
> at 
> org.apache.impala.analysis.Analyzer.resolveTableRef(Analyzer.java:974)
> at 
> org.apache.impala.analysis.DropStatsStmt.analyze(DropStatsStmt.java:94)
> at 
> org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:551)
> at 
> org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:498)
> at 
> org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2542)
> at 
> org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2224)
> at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1985)
> at 
> org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175)
> Caused by: org.apache.impala.catalog.TableLoadingException: Failed to load 
> metadata for table: default.alltypes_bak
> CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
> at 
> org.apache.impala.catalog.IncompleteTable.loadFromThrift(IncompleteTable.java:162)
> at org.apache.impala.catalog.Table.fromThrift(Table.java:586)
> at 
> org.apache.impala.catalog.ImpaladCatalog.addTable(ImpaladCatalog.java:479)
> at 
> org.apache.impala.catalog.ImpaladCatalog.addCatalogObject(ImpaladCatalog.java:334)
> at 
> org.apache.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:262)
> at 
> org.apache.impala.service.FeCatalogManager$CatalogdImpl.updateCatalogCache(FeCatalogManager.java:114)
> at 
> org.apache.impala.service.Frontend.updateCatalogCache(Frontend.java:585)
> at 
> org.apache.impala.service.JniFrontend.updateCatalogCache(JniFrontend.java:196)
> at .: 
> org.apache.impala.catalog.TableLoadingException: Failed to load metadata for 
> table: default.alltypes_bak
> at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1318)
> at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1213)
> at org.apache.impala.catalog.TableLoader.load(TableLoader.java:145)
> at 
> org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:251)
> at 
> org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:247)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:512)
> at 
> org.apache.impala.catalog.ColumnStats.validate(ColumnStats.java:1034)
> at org.apache.impala.catalog.ColumnStats.update(ColumnStats.java:676)
> at org.apache.impala.catalog.Column.updateStats(Column.java:73)
> at 
> org.apache.impala.catalog.FeCatalogUtils.injectColumnStats(FeCatalogUtils.java:183)
> at 

[jira] [Commented] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users

2024-05-22 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848508#comment-17848508
 ] 

Quanlong Huang commented on IMPALA-12190:
-

Column masking and row filtering policies will also be messed up by RENAME. I 
think tag based policy will also be messed up if data lineages are not updated 
accordingly.

+1 for a new Ranger API that returns all policies matching a given table (and 
optionally for a given user). We also need this to improve IMPALA-11501 to 
avoid loading the table schema from HMS. Currently, to check whether a user has 
a corresponding column masking policy on a table, we have to load the table to 
get all the column names and check whether there are policies on each column, 
which is inefficient.

> Renaming table will cause losing privileges for non-admin users
> ---
>
> Key: IMPALA-12190
> URL: https://issues.apache.org/jira/browse/IMPALA-12190
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Gabor Kaszab
>Assignee: Sai Hemanth Gantasala
>Priority: Critical
>  Labels: alter-table, authorization, ranger
>
> Let's say user 'a' gets some privileges on table 't'. When this table gets 
> renamed (even by user 'a') then user 'a' loses its privileges on that table.
>  
> Repro steps:
>  # Start impala with Ranger
>  # start impala-shell as admin (-u admin)
>  # create table tmp (i int, s string) stored as parquet;
>  # grant all on table tmp to user ;
>  # grant all on table tmp to user ;
> {code:java}
> Query: show grant user  on table tmp
> +++--+---++-+--+-+-+---+--+-+
> | principal_type | principal_name | database | table | column | uri | 
> storage_type | storage_uri | udf | privilege | grant_option | create_time |
> +++--+---++-+--+-+-+---+--+-+
> | USER           |     | default  | tmp   | *      |     |          
>     |             |     | all       | false        | NULL        |
> +++--+---++-+--+-+-+---+--+-+
> Fetched 1 row(s) in 0.01s {code}
>  #  alter table tmp rename to tmp_1234;
>  # show grant user  on table tmp_1234;
> {code:java}
> Query: show grant user  on table tmp_1234
> Fetched 0 row(s) in 0.17s{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13074) WRITE TO HDFS node is omitted from Web UI graphic plan

2024-05-21 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13074:

Labels: ramp-up  (was: )

> WRITE TO HDFS node is omitted from Web UI graphic plan
> --
>
> Key: IMPALA-13074
> URL: https://issues.apache.org/jira/browse/IMPALA-13074
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Noemi Pap-Takacs
>Priority: Major
>  Labels: ramp-up
>
> The query plan shows the nodes that take part in the execution, forming a 
> tree structure.
> It can be displayed in the CLI by issuing the EXPLAIN  command. When 
> the actual query is executed, the plan tree can also be viewed in the Impala 
> Web UI in a graphic form.
> However, the explain string and the graphic plan tree does not match: the top 
> node is missing from the Web UI.
> This is especially confusing in case of DDL and DML statements, where the 
> Data Sink is not displayed. This makes a SELECT * FROM table 
> indistinguishable from a CREATE TABLE, since both only displays the SCAN node 
> and omit the WRITE_TO_HDFS and SELECT node.
> It would make sense to include the WRITE_TO_HDFS node in DML/DDL plans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-13074) WRITE TO HDFS node is omitted from Web UI graphic plan

2024-05-21 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848422#comment-17848422
 ] 

Quanlong Huang commented on IMPALA-13074:
-

Names like "HDFS WRITER", "KUDU WRITER" will be consistent with the ExecSummary.

> WRITE TO HDFS node is omitted from Web UI graphic plan
> --
>
> Key: IMPALA-13074
> URL: https://issues.apache.org/jira/browse/IMPALA-13074
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Noemi Pap-Takacs
>Priority: Major
>
> The query plan shows the nodes that take part in the execution, forming a 
> tree structure.
> It can be displayed in the CLI by issuing the EXPLAIN  command. When 
> the actual query is executed, the plan tree can also be viewed in the Impala 
> Web UI in a graphic form.
> However, the explain string and the graphic plan tree does not match: the top 
> node is missing from the Web UI.
> This is especially confusing in case of DDL and DML statements, where the 
> Data Sink is not displayed. This makes a SELECT * FROM table 
> indistinguishable from a CREATE TABLE, since both only displays the SCAN node 
> and omit the WRITE_TO_HDFS and SELECT node.
> It would make sense to include the WRITE_TO_HDFS node in DML/DDL plans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-13102) Loading tables with illegal stats failed

2024-05-21 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848395#comment-17848395
 ] 

Quanlong Huang commented on IMPALA-13102:
-

Uploaded a patch for review: https://gerrit.cloudera.org/c/21445/

> Loading tables with illegal stats failed
> 
>
> Key: IMPALA-13102
> URL: https://issues.apache.org/jira/browse/IMPALA-13102
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>
> When the table has illegal stats, e.g. numDVs=-100, Impala can't load the 
> table. So DROP STATS or DROP TABLE can't be perform on the table.
> {code:sql}
> [localhost:21050] default> drop stats alltypes_bak;
> Query: drop stats alltypes_bak
> ERROR: AnalysisException: Failed to load metadata for table: 'alltypes_bak'
> CAUSED BY: TableLoadingException: Failed to load metadata for table: 
> default.alltypes_bak
> CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}{code}
> We should allow at least dropping the stats or dropping the table. So user 
> can use Impala to recover the stats.
> Stacktrace in the logs:
> {noformat}
> I0520 08:00:56.661746 17543 jni-util.cc:321] 
> 5343142d1173494f:44dcde8c] 
> org.apache.impala.common.AnalysisException: Failed to load metadata for 
> table: 'alltypes_bak'
> at 
> org.apache.impala.analysis.Analyzer.resolveTableRef(Analyzer.java:974)
> at 
> org.apache.impala.analysis.DropStatsStmt.analyze(DropStatsStmt.java:94)
> at 
> org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:551)
> at 
> org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:498)
> at 
> org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2542)
> at 
> org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2224)
> at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1985)
> at 
> org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175)
> Caused by: org.apache.impala.catalog.TableLoadingException: Failed to load 
> metadata for table: default.alltypes_bak
> CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
> at 
> org.apache.impala.catalog.IncompleteTable.loadFromThrift(IncompleteTable.java:162)
> at org.apache.impala.catalog.Table.fromThrift(Table.java:586)
> at 
> org.apache.impala.catalog.ImpaladCatalog.addTable(ImpaladCatalog.java:479)
> at 
> org.apache.impala.catalog.ImpaladCatalog.addCatalogObject(ImpaladCatalog.java:334)
> at 
> org.apache.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:262)
> at 
> org.apache.impala.service.FeCatalogManager$CatalogdImpl.updateCatalogCache(FeCatalogManager.java:114)
> at 
> org.apache.impala.service.Frontend.updateCatalogCache(Frontend.java:585)
> at 
> org.apache.impala.service.JniFrontend.updateCatalogCache(JniFrontend.java:196)
> at .: 
> org.apache.impala.catalog.TableLoadingException: Failed to load metadata for 
> table: default.alltypes_bak
> at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1318)
> at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1213)
> at org.apache.impala.catalog.TableLoader.load(TableLoader.java:145)
> at 
> org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:251)
> at 
> org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:247)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:512)
> at 
> org.apache.impala.catalog.ColumnStats.validate(ColumnStats.java:1034)
> at org.apache.impala.catalog.ColumnStats.update(ColumnStats.java:676)
> at org.apache.impala.catalog.Column.updateStats(Column.java:73)
> at 
> org.apache.impala.catalog.FeCatalogUtils.injectColumnStats(FeCatalogUtils.java:183)
> at 

[jira] [Created] (IMPALA-13103) Corrupt column stats are not reported

2024-05-20 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13103:
---

 Summary: Corrupt column stats are not reported
 Key: IMPALA-13103
 URL: https://issues.apache.org/jira/browse/IMPALA-13103
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Quanlong Huang


Impala will report corrupt table stats in the query plan. However, corrupt 
column stats are not reported. For instance, consider the following table:
{code:sql}
create table t1 (id int, name string);
insert into t1 values (1, 'aaa'), (2, 'aaa'), (3, 'aaa'), (4, 'aaa');{code}
with the following stats:
{code:sql}
alter table t1 set tblproperties('numRows'='4');
alter table t1 set column stats name ('numNulls'='0');{code}
Note that column "id" has missing stats and column "name" has missing/corrupt 
stats (ndv=-1, numNulls=0).
Grouping by "id" will report the missing stats:
{code:sql}
explain select id, count(*) from t1 group by id;

WARNING: The following tables are missing relevant table and/or column 
statistics.
default.t1{code}
However, grouping by "name" doesn't report the missing/corrupt stats:
{noformat}
explain select name, count(*) from t1 group by name;
+---+
| Explain String
|
+---+
| Max Per-Host Resource Reservation: Memory=38.00MB Threads=2   
|
| Per-Host Resource Estimates: Memory=144MB 
|
| Codegen disabled by planner   
|
| Analyzed query: SELECT name, count(*) FROM `default`.t1 GROUP BY name 
|
|   
|
| F00:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 
|
| |  Per-Host Resources: mem-estimate=144.00MB mem-reservation=38.00MB 
thread-reservation=2 |
| PLAN-ROOT SINK
|
| |  output exprs: name, count(*)   
|
| |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
thread-reservation=0|
| | 
|
| 01:AGGREGATE [FINALIZE]   
|
| |  output: count(*)   
|
| |  group by: name 
|
| |  mem-estimate=128.00MB mem-reservation=34.00MB spill-buffer=2.00MB 
thread-reservation=0 |
| |  tuple-ids=1 row-size=20B cardinality=4 
|
| |  in pipelines: 01(GETNEXT), 00(OPEN)
|
| | 
|
| 00:SCAN HDFS [default.t1] 
|
|HDFS partitions=1/1 files=1 size=24B   
|
|stored statistics: 
|
|  table: rows=4 size=unavailable   
|
|  columns: all 
|
|extrapolated-rows=disabled max-scan-range-rows=4   
|
|mem-estimate=16.00MB mem-reservation=8.00KB thread-reservation=1   
|
|tuple-ids=0 row-size=12B cardinality=4 
|
|in pipelines: 00(GETNEXT)  
|
+---+
{noformat}
CC [~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13103) Corrupt column stats are not reported

2024-05-20 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13103:
---

 Summary: Corrupt column stats are not reported
 Key: IMPALA-13103
 URL: https://issues.apache.org/jira/browse/IMPALA-13103
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Quanlong Huang


Impala will report corrupt table stats in the query plan. However, corrupt 
column stats are not reported. For instance, consider the following table:
{code:sql}
create table t1 (id int, name string);
insert into t1 values (1, 'aaa'), (2, 'aaa'), (3, 'aaa'), (4, 'aaa');{code}
with the following stats:
{code:sql}
alter table t1 set tblproperties('numRows'='4');
alter table t1 set column stats name ('numNulls'='0');{code}
Note that column "id" has missing stats and column "name" has missing/corrupt 
stats (ndv=-1, numNulls=0).
Grouping by "id" will report the missing stats:
{code:sql}
explain select id, count(*) from t1 group by id;

WARNING: The following tables are missing relevant table and/or column 
statistics.
default.t1{code}
However, grouping by "name" doesn't report the missing/corrupt stats:
{noformat}
explain select name, count(*) from t1 group by name;
+---+
| Explain String
|
+---+
| Max Per-Host Resource Reservation: Memory=38.00MB Threads=2   
|
| Per-Host Resource Estimates: Memory=144MB 
|
| Codegen disabled by planner   
|
| Analyzed query: SELECT name, count(*) FROM `default`.t1 GROUP BY name 
|
|   
|
| F00:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 
|
| |  Per-Host Resources: mem-estimate=144.00MB mem-reservation=38.00MB 
thread-reservation=2 |
| PLAN-ROOT SINK
|
| |  output exprs: name, count(*)   
|
| |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
thread-reservation=0|
| | 
|
| 01:AGGREGATE [FINALIZE]   
|
| |  output: count(*)   
|
| |  group by: name 
|
| |  mem-estimate=128.00MB mem-reservation=34.00MB spill-buffer=2.00MB 
thread-reservation=0 |
| |  tuple-ids=1 row-size=20B cardinality=4 
|
| |  in pipelines: 01(GETNEXT), 00(OPEN)
|
| | 
|
| 00:SCAN HDFS [default.t1] 
|
|HDFS partitions=1/1 files=1 size=24B   
|
|stored statistics: 
|
|  table: rows=4 size=unavailable   
|
|  columns: all 
|
|extrapolated-rows=disabled max-scan-range-rows=4   
|
|mem-estimate=16.00MB mem-reservation=8.00KB thread-reservation=1   
|
|tuple-ids=0 row-size=12B cardinality=4 
|
|in pipelines: 00(GETNEXT)  
|
+---+
{noformat}
CC [~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IMPALA-13102) Loading tables with illegal stats failed

2024-05-19 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847742#comment-17847742
 ] 

Quanlong Huang commented on IMPALA-13102:
-

In the Impala dev env, I can set the stats directly in postgresql:
{code:sql}
psql -q -U hiveuser -d ${METASTORE_DB}

HMS_home_quanlong_workspace_Impala_cdp=> select "TBL_ID" from "TBLS" where 
"TBL_NAME" = 'alltypes_bak';
 TBL_ID 

 244931
(1 row)
HMS_home_quanlong_workspace_Impala_cdp=>  select "CS_ID", "DB_NAME", 
"TABLE_NAME", "COLUMN_NAME", "NUM_DISTINCTS" from "TAB_COL_STATS" where 
"TBL_ID" = 244931;
 CS_ID | DB_NAME |  TABLE_NAME  |   COLUMN_NAME   | NUM_DISTINCTS 
---+-+--+-+---
 68767 | default | alltypes_bak | double_col  |10
 68766 | default | alltypes_bak | id  |  7300
 68765 | default | alltypes_bak | tinyint_col |10
 68764 | default | alltypes_bak | timestamp_col   |  7300
 68763 | default | alltypes_bak | smallint_col|10
 68762 | default | alltypes_bak | date_string_col |   736
 68761 | default | alltypes_bak | string_col  |10
 68760 | default | alltypes_bak | float_col   |10
 68759 | default | alltypes_bak | bigint_col  |10
 68758 | default | alltypes_bak | year| 2
 68757 | default | alltypes_bak | bool_col|  
 68756 | default | alltypes_bak | int_col |10
(12 rows)
HMS_home_quanlong_workspace_Impala_cdp=> UPDATE "TAB_COL_STATS" SET 
"NUM_DISTINCTS" = -100 where "CS_ID" = 68766;
HMS_home_quanlong_workspace_Impala_cdp=> select "CS_ID", "DB_NAME", 
"TABLE_NAME", "COLUMN_NAME", "NUM_DISTINCTS" from "TAB_COL_STATS" where "CS_ID" 
= 68766;
 CS_ID | DB_NAME |  TABLE_NAME  | COLUMN_NAME | NUM_DISTINCTS 
---+-+--+-+---
 68766 | default | alltypes_bak | id  |  -100
(1 row)
{code}

> Loading tables with illegal stats failed
> 
>
> Key: IMPALA-13102
> URL: https://issues.apache.org/jira/browse/IMPALA-13102
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>
> When the table has illegal stats, e.g. numDVs=-100, Impala can't load the 
> table. So DROP STATS or DROP TABLE can't be perform on the table.
> {code:sql}
> [localhost:21050] default> drop stats alltypes_bak;
> Query: drop stats alltypes_bak
> ERROR: AnalysisException: Failed to load metadata for table: 'alltypes_bak'
> CAUSED BY: TableLoadingException: Failed to load metadata for table: 
> default.alltypes_bak
> CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}{code}
> We should allow at least dropping the stats or dropping the table. So user 
> can use Impala to recover the stats.
> Stacktrace in the logs:
> {noformat}
> I0520 08:00:56.661746 17543 jni-util.cc:321] 
> 5343142d1173494f:44dcde8c] 
> org.apache.impala.common.AnalysisException: Failed to load metadata for 
> table: 'alltypes_bak'
> at 
> org.apache.impala.analysis.Analyzer.resolveTableRef(Analyzer.java:974)
> at 
> org.apache.impala.analysis.DropStatsStmt.analyze(DropStatsStmt.java:94)
> at 
> org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:551)
> at 
> org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:498)
> at 
> org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2542)
> at 
> org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2224)
> at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1985)
> at 
> org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175)
> Caused by: org.apache.impala.catalog.TableLoadingException: Failed to load 
> metadata for table: default.alltypes_bak
> CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
> avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
> numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
> at 
> org.apache.impala.catalog.IncompleteTable.loadFromThrift(IncompleteTable.java:162)
> at org.apache.impala.catalog.Table.fromThrift(Table.java:586)
> at 
> org.apache.impala.catalog.ImpaladCatalog.addTable(ImpaladCatalog.java:479)
> at 
> org.apache.impala.catalog.ImpaladCatalog.addCatalogObject(ImpaladCatalog.java:334)
> at 
> org.apache.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:262)
> at 
> 

[jira] [Updated] (IMPALA-13102) Loading tables with illegal stats failed

2024-05-19 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13102:

Description: 
When the table has illegal stats, e.g. numDVs=-100, Impala can't load the 
table. So DROP STATS or DROP TABLE can't be perform on the table.

{code:sql}
[localhost:21050] default> drop stats alltypes_bak;
Query: drop stats alltypes_bak
ERROR: AnalysisException: Failed to load metadata for table: 'alltypes_bak'
CAUSED BY: TableLoadingException: Failed to load metadata for table: 
default.alltypes_bak
CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}{code}

We should allow at least dropping the stats or dropping the table. So user can 
use Impala to recover the stats.

Stacktrace in the logs:
{noformat}
I0520 08:00:56.661746 17543 jni-util.cc:321] 5343142d1173494f:44dcde8c] 
org.apache.impala.common.AnalysisException: Failed to load metadata for table: 
'alltypes_bak'
at 
org.apache.impala.analysis.Analyzer.resolveTableRef(Analyzer.java:974)
at 
org.apache.impala.analysis.DropStatsStmt.analyze(DropStatsStmt.java:94)
at 
org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:551)
at 
org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:498)
at 
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2542)
at 
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2224)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1985)
at 
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175)
Caused by: org.apache.impala.catalog.TableLoadingException: Failed to load 
metadata for table: default.alltypes_bak
CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
at 
org.apache.impala.catalog.IncompleteTable.loadFromThrift(IncompleteTable.java:162)
at org.apache.impala.catalog.Table.fromThrift(Table.java:586)
at 
org.apache.impala.catalog.ImpaladCatalog.addTable(ImpaladCatalog.java:479)
at 
org.apache.impala.catalog.ImpaladCatalog.addCatalogObject(ImpaladCatalog.java:334)
at 
org.apache.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:262)
at 
org.apache.impala.service.FeCatalogManager$CatalogdImpl.updateCatalogCache(FeCatalogManager.java:114)
at 
org.apache.impala.service.Frontend.updateCatalogCache(Frontend.java:585)
at 
org.apache.impala.service.JniFrontend.updateCatalogCache(JniFrontend.java:196)
at .: 
org.apache.impala.catalog.TableLoadingException: Failed to load metadata for 
table: default.alltypes_bak
at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1318)
at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1213)
at org.apache.impala.catalog.TableLoader.load(TableLoader.java:145)
at 
org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:251)
at 
org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:247)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalStateException: ColumnStats{avgSize_=4.0, 
avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:512)
at org.apache.impala.catalog.ColumnStats.validate(ColumnStats.java:1034)
at org.apache.impala.catalog.ColumnStats.update(ColumnStats.java:676)
at org.apache.impala.catalog.Column.updateStats(Column.java:73)
at 
org.apache.impala.catalog.FeCatalogUtils.injectColumnStats(FeCatalogUtils.java:183)
at org.apache.impala.catalog.Table.loadAllColumnStats(Table.java:513)
at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1269)
... 8 more{noformat}
CC [~VenuReddy] [~hemanth619] [~ngangam]

  was:
When the table has illegal stats, e.g. numDVs=-100, Impala can't load the 
table. So DROP STATS or DROP TABLE can't be perform on the table.

{code:sql}
[localhost:21050] default> drop stats alltypes_bak;
Query: drop stats alltypes_bak
ERROR: AnalysisException: Failed to load metadata for table: 'alltypes_bak'
CAUSED BY: TableLoadingException: Failed to load metadata for table: 
default.alltypes_bak
CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 

[jira] [Created] (IMPALA-13102) Loading tables with illegal stats failed

2024-05-19 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13102:
---

 Summary: Loading tables with illegal stats failed
 Key: IMPALA-13102
 URL: https://issues.apache.org/jira/browse/IMPALA-13102
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang


When the table has illegal stats, e.g. numDVs=-100, Impala can't load the 
table. So DROP STATS or DROP TABLE can't be perform on the table.

{code:sql}
[localhost:21050] default> drop stats alltypes_bak;
Query: drop stats alltypes_bak
ERROR: AnalysisException: Failed to load metadata for table: 'alltypes_bak'
CAUSED BY: TableLoadingException: Failed to load metadata for table: 
default.alltypes_bak
CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}{code}

We should allow at least dropping the stats or dropping the table. So user can 
use Impala to recover the stats.

Stacktrace in the logs:
{noformat}
I0520 08:00:56.661746 17543 jni-util.cc:321] 5343142d1173494f:44dcde8c] 
org.apache.impala.common.AnalysisException: Failed to load metadata for table: 
'alltypes_bak'
at 
org.apache.impala.analysis.Analyzer.resolveTableRef(Analyzer.java:974)
at 
org.apache.impala.analysis.DropStatsStmt.analyze(DropStatsStmt.java:94)
at 
org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:551)
at 
org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:498)
at 
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2542)
at 
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2224)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1985)
at 
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175)
Caused by: org.apache.impala.catalog.TableLoadingException: Failed to load 
metadata for table: default.alltypes_bak
CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
at 
org.apache.impala.catalog.IncompleteTable.loadFromThrift(IncompleteTable.java:162)
at org.apache.impala.catalog.Table.fromThrift(Table.java:586)
at 
org.apache.impala.catalog.ImpaladCatalog.addTable(ImpaladCatalog.java:479)
at 
org.apache.impala.catalog.ImpaladCatalog.addCatalogObject(ImpaladCatalog.java:334)
at 
org.apache.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:262)
at 
org.apache.impala.service.FeCatalogManager$CatalogdImpl.updateCatalogCache(FeCatalogManager.java:114)
at 
org.apache.impala.service.Frontend.updateCatalogCache(Frontend.java:585)
at 
org.apache.impala.service.JniFrontend.updateCatalogCache(JniFrontend.java:196)
at .: 
org.apache.impala.catalog.TableLoadingException: Failed to load metadata for 
table: default.alltypes_bak
at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1318)
at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1213)
at org.apache.impala.catalog.TableLoader.load(TableLoader.java:145)
at 
org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:251)
at 
org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:247)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalStateException: ColumnStats{avgSize_=4.0, 
avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:512)
at org.apache.impala.catalog.ColumnStats.validate(ColumnStats.java:1034)
at org.apache.impala.catalog.ColumnStats.update(ColumnStats.java:676)
at org.apache.impala.catalog.Column.updateStats(Column.java:73)
at 
org.apache.impala.catalog.FeCatalogUtils.injectColumnStats(FeCatalogUtils.java:183)
at org.apache.impala.catalog.Table.loadAllColumnStats(Table.java:513)
at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1269)
... 8 more{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13102) Loading tables with illegal stats failed

2024-05-19 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13102:
---

 Summary: Loading tables with illegal stats failed
 Key: IMPALA-13102
 URL: https://issues.apache.org/jira/browse/IMPALA-13102
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog
Reporter: Quanlong Huang
Assignee: Quanlong Huang


When the table has illegal stats, e.g. numDVs=-100, Impala can't load the 
table. So DROP STATS or DROP TABLE can't be perform on the table.

{code:sql}
[localhost:21050] default> drop stats alltypes_bak;
Query: drop stats alltypes_bak
ERROR: AnalysisException: Failed to load metadata for table: 'alltypes_bak'
CAUSED BY: TableLoadingException: Failed to load metadata for table: 
default.alltypes_bak
CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}{code}

We should allow at least dropping the stats or dropping the table. So user can 
use Impala to recover the stats.

Stacktrace in the logs:
{noformat}
I0520 08:00:56.661746 17543 jni-util.cc:321] 5343142d1173494f:44dcde8c] 
org.apache.impala.common.AnalysisException: Failed to load metadata for table: 
'alltypes_bak'
at 
org.apache.impala.analysis.Analyzer.resolveTableRef(Analyzer.java:974)
at 
org.apache.impala.analysis.DropStatsStmt.analyze(DropStatsStmt.java:94)
at 
org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:551)
at 
org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:498)
at 
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2542)
at 
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2224)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1985)
at 
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:175)
Caused by: org.apache.impala.catalog.TableLoadingException: Failed to load 
metadata for table: default.alltypes_bak
CAUSED BY: IllegalStateException: ColumnStats{avgSize_=4.0, 
avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
at 
org.apache.impala.catalog.IncompleteTable.loadFromThrift(IncompleteTable.java:162)
at org.apache.impala.catalog.Table.fromThrift(Table.java:586)
at 
org.apache.impala.catalog.ImpaladCatalog.addTable(ImpaladCatalog.java:479)
at 
org.apache.impala.catalog.ImpaladCatalog.addCatalogObject(ImpaladCatalog.java:334)
at 
org.apache.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:262)
at 
org.apache.impala.service.FeCatalogManager$CatalogdImpl.updateCatalogCache(FeCatalogManager.java:114)
at 
org.apache.impala.service.Frontend.updateCatalogCache(Frontend.java:585)
at 
org.apache.impala.service.JniFrontend.updateCatalogCache(JniFrontend.java:196)
at .: 
org.apache.impala.catalog.TableLoadingException: Failed to load metadata for 
table: default.alltypes_bak
at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1318)
at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1213)
at org.apache.impala.catalog.TableLoader.load(TableLoader.java:145)
at 
org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:251)
at 
org.apache.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:247)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalStateException: ColumnStats{avgSize_=4.0, 
avgSerializedSize_=4.0, maxSize_=4, numDistinct_=-100, numNulls_=0, 
numTrues=-1, numFalses=-1, lowValue=-1, highValue=-1}
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:512)
at org.apache.impala.catalog.ColumnStats.validate(ColumnStats.java:1034)
at org.apache.impala.catalog.ColumnStats.update(ColumnStats.java:676)
at org.apache.impala.catalog.Column.updateStats(Column.java:73)
at 
org.apache.impala.catalog.FeCatalogUtils.injectColumnStats(FeCatalogUtils.java:183)
at org.apache.impala.catalog.Table.loadAllColumnStats(Table.java:513)
at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1269)
... 8 more{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-13094) Query links in /admission page of admissiond doesn't work

2024-05-17 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13094:
---

 Summary: Query links in /admission page of admissiond doesn't work
 Key: IMPALA-13094
 URL: https://issues.apache.org/jira/browse/IMPALA-13094
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Reporter: Quanlong Huang
 Attachments: Selection_115.png, Selection_116.png

In the /admission page, there are records for queued queries and running 
queries. The details links for these queries use the hostname of the 
admissiond. Instead, they should point to the corresponding coordinators.

Clicking on the link will jump to the /query_plan endpoint of the admissiond 
which doesn't exist. Thus failed by Error: No URI handler for '/query_plan'.

Attached the screenshots for reference.

CC [~arawat] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13094) Query links in /admission page of admissiond doesn't work

2024-05-17 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13094:
---

 Summary: Query links in /admission page of admissiond doesn't work
 Key: IMPALA-13094
 URL: https://issues.apache.org/jira/browse/IMPALA-13094
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Reporter: Quanlong Huang
 Attachments: Selection_115.png, Selection_116.png

In the /admission page, there are records for queued queries and running 
queries. The details links for these queries use the hostname of the 
admissiond. Instead, they should point to the corresponding coordinators.

Clicking on the link will jump to the /query_plan endpoint of the admissiond 
which doesn't exist. Thus failed by Error: No URI handler for '/query_plan'.

Attached the screenshots for reference.

CC [~arawat] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IMPALA-13094) Query links in /admission page of admissiond doesn't work

2024-05-17 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13094:

Attachment: Selection_116.png

> Query links in /admission page of admissiond doesn't work
> -
>
> Key: IMPALA-13094
> URL: https://issues.apache.org/jira/browse/IMPALA-13094
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Quanlong Huang
>Priority: Critical
> Attachments: Selection_115.png, Selection_116.png
>
>
> In the /admission page, there are records for queued queries and running 
> queries. The details links for these queries use the hostname of the 
> admissiond. Instead, they should point to the corresponding coordinators.
> Clicking on the link will jump to the /query_plan endpoint of the admissiond 
> which doesn't exist. Thus failed by Error: No URI handler for '/query_plan'.
> Attached the screenshots for reference.
> CC [~arawat] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13094) Query links in /admission page of admissiond doesn't work

2024-05-17 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13094:

Attachment: Selection_115.png

> Query links in /admission page of admissiond doesn't work
> -
>
> Key: IMPALA-13094
> URL: https://issues.apache.org/jira/browse/IMPALA-13094
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Quanlong Huang
>Priority: Critical
> Attachments: Selection_115.png, Selection_116.png
>
>
> In the /admission page, there are records for queued queries and running 
> queries. The details links for these queries use the hostname of the 
> admissiond. Instead, they should point to the corresponding coordinators.
> Clicking on the link will jump to the /query_plan endpoint of the admissiond 
> which doesn't exist. Thus failed by Error: No URI handler for '/query_plan'.
> Attached the screenshots for reference.
> CC [~arawat] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13093) Insert into Huawei OBS table failed

2024-05-16 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13093:
---

 Summary: Insert into Huawei OBS table failed
 Key: IMPALA-13093
 URL: https://issues.apache.org/jira/browse/IMPALA-13093
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Affects Versions: Impala 4.3.0
Reporter: Quanlong Huang
Assignee: Quanlong Huang


Insert into a table using Huawei OBS (Object Storage Service) as the storage 
will failed by the following error:
{noformat}
Query: insert into test_obs1 values (1, 'abc')

ERROR: Failed to get info on temporary HDFS file: 
obs://obs-test-ee93/input/test_obs1/_impala_insert_staging/fe4ac1be6462a13f_362a9b5b/.fe4ac1be6462a13f-362a9b5b_1213692075_dir//fe4ac1be6462a13f-362a9b5b_375832652_data.0.txt
Error(2): No such file or directory {noformat}
Looking into the logs:
{noformat}
I0516 16:40:55.663640 18922 status.cc:129] fe4ac1be6462a13f:362a9b5b] 
Failed to get info on temporary HDFS file: 
obs://obs-test-ee93/input/test_obs1/_impala_insert_staging/fe4ac1be6462a13f_362a9b5b/.fe4ac1be6462a13f-362a9b5b_1213692075_dir//fe4ac1be6462a13f-362a9b5b_375832652_data.0.txt
Error(2): No such file or directory
@   0xfc6d44  impala::Status::Status()
@  0x1c42020  impala::HdfsTableSink::CreateNewTmpFile()
@  0x1c44357  impala::HdfsTableSink::InitOutputPartition()
@  0x1c4988a  impala::HdfsTableSink::GetOutputPartition()
@  0x1c46569  impala::HdfsTableSink::Send()
@  0x14ee25f  impala::FragmentInstanceState::ExecInternal()
@  0x14efca3  impala::FragmentInstanceState::Exec()
@  0x148dc4c  impala::QueryState::ExecFInstance()
@  0x1b3bab9  impala::Thread::SuperviseThread()
@  0x1b3cdb1  boost::detail::thread_data<>::run()
@  0x2474a87  thread_proxy
@ 0x7fe5a562dea5  start_thread
@ 0x7fe5a25ddb0d  __clone{noformat}
Note that impalad is started with {{--symbolize_stacktrace=true}} so the 
stacktrace has symbols.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13093) Insert into Huawei OBS table failed

2024-05-16 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13093:
---

 Summary: Insert into Huawei OBS table failed
 Key: IMPALA-13093
 URL: https://issues.apache.org/jira/browse/IMPALA-13093
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Affects Versions: Impala 4.3.0
Reporter: Quanlong Huang
Assignee: Quanlong Huang


Insert into a table using Huawei OBS (Object Storage Service) as the storage 
will failed by the following error:
{noformat}
Query: insert into test_obs1 values (1, 'abc')

ERROR: Failed to get info on temporary HDFS file: 
obs://obs-test-ee93/input/test_obs1/_impala_insert_staging/fe4ac1be6462a13f_362a9b5b/.fe4ac1be6462a13f-362a9b5b_1213692075_dir//fe4ac1be6462a13f-362a9b5b_375832652_data.0.txt
Error(2): No such file or directory {noformat}
Looking into the logs:
{noformat}
I0516 16:40:55.663640 18922 status.cc:129] fe4ac1be6462a13f:362a9b5b] 
Failed to get info on temporary HDFS file: 
obs://obs-test-ee93/input/test_obs1/_impala_insert_staging/fe4ac1be6462a13f_362a9b5b/.fe4ac1be6462a13f-362a9b5b_1213692075_dir//fe4ac1be6462a13f-362a9b5b_375832652_data.0.txt
Error(2): No such file or directory
@   0xfc6d44  impala::Status::Status()
@  0x1c42020  impala::HdfsTableSink::CreateNewTmpFile()
@  0x1c44357  impala::HdfsTableSink::InitOutputPartition()
@  0x1c4988a  impala::HdfsTableSink::GetOutputPartition()
@  0x1c46569  impala::HdfsTableSink::Send()
@  0x14ee25f  impala::FragmentInstanceState::ExecInternal()
@  0x14efca3  impala::FragmentInstanceState::Exec()
@  0x148dc4c  impala::QueryState::ExecFInstance()
@  0x1b3bab9  impala::Thread::SuperviseThread()
@  0x1b3cdb1  boost::detail::thread_data<>::run()
@  0x2474a87  thread_proxy
@ 0x7fe5a562dea5  start_thread
@ 0x7fe5a25ddb0d  __clone{noformat}
Note that impalad is started with {{--symbolize_stacktrace=true}} so the 
stacktrace has symbols.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IMPALA-13086) Cardinality estimate of AggregationNode should consider predicates on group-by columns

2024-05-15 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13086:

Attachment: plan.txt

> Cardinality estimate of AggregationNode should consider predicates on 
> group-by columns
> --
>
> Key: IMPALA-13086
> URL: https://issues.apache.org/jira/browse/IMPALA-13086
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Reporter: Quanlong Huang
>Priority: Critical
> Attachments: plan.txt
>
>
> Consider the following tables:
> {code:sql}
> CREATE EXTERNAL TABLE t1(
>   t1_id bigint,
>   t5_id bigint,
>   t5_name string,
>   register_date string
> ) stored as textfile;
> CREATE EXTERNAL TABLE t2(
>   t1_id bigint,
>   t3_id bigint,
>   pay_time timestamp,
>   refund_time timestamp,
>   state_code int
> ) stored as textfile;
> CREATE EXTERNAL TABLE t3(
>   t3_id bigint,
>   t3_name string,
>   class_id int
> ) stored as textfile;
> CREATE EXTERNAL TABLE t5( 
>   id bigint,
>   t5_id bigint,
>   t5_name string,
>   branch_id bigint,
>   branch_name string
> ) stored as textfile;
> alter table t1 set tblproperties('numRows'='6031170829');
> alter table t1 set column stats t1_id ('numDVs'='8131016','numNulls'='0');
> alter table t1 set column stats t5_id ('numDVs'='389','numNulls'='0');
> alter table t1 set column stats t5_name 
> ('numDVs'='523','numNulls'='85928157','maxsize'='27','avgSize'='17.79120063781738');
> alter table t1 set column stats register_date 
> ('numDVs'='9283','numNulls'='0','maxsize'='8','avgSize'='8');
> alter table t2 set tblproperties('numRows'='864341085');
> alter table t2 set column stats t1_id ('numDVs'='1007302','numNulls'='0');
> alter table t2 set column stats t3_id ('numDVs'='5013','numNulls'='2800503');
> alter table t2 set column stats pay_time ('numDVs'='1372020','numNulls'='0');
> alter table t2 set column stats refund_time 
> ('numDVs'='251658','numNulls'='791645118');
> alter table t2 set column stats state_code ('numDVs'='8','numNulls'='0');
> alter table t3 set tblproperties('numRows'='4452');
> alter table t3 set column stats t3_id ('numDVs'='4452','numNulls'='0');
> alter table t3 set column stats t3_name 
> ('numDVs'='4452','numNulls'='0','maxsize'='176','avgSize'='37.60469818115234');
> alter table t3 set column stats class_id ('numDVs'='75','numNulls'='0');
> alter table t5 set tblproperties('numRows'='2177245');
> alter table t5 set column stats t5_id ('numDVs'='826','numNulls'='0');
> alter table t5 set column stats t5_name 
> ('numDVs'='523','numNulls'='0','maxsize'='67','avgSize'='19.12560081481934');
> alter table t5 set column stats branch_id ('numDVs'='53','numNulls'='0');
> alter table t5 set column stats branch_name 
> ('numDVs'='55','numNulls'='0','maxsize'='61','avgSize'='16.05229949951172');
> {code}
> Put a data file to each table to make the stats valid
> {code:bash}
> echo '2024' > data.txt
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t1
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t2
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t3
> hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t5
> {code}
> REFRESH these tables after adding the data files.
> The cardinality of AggregationNodes are overestimated in the following query:
> {code:sql}
> explain select 
>   register_date,
>   t4.t5_id, 
>   t5.t5_name,
>   t5.branch_name,
>   count(distinct t1_id),
>   count(distinct case when diff_day=0 then t1_id else null end ),
>   count(distinct case when diff_day<=3 then t1_id else null end ),
>   count(distinct case when diff_day<=7 then t1_id else null end ),
>   count(distinct case when diff_day<=14 then t1_id else null end ),
>   count(distinct case when diff_day<=30 then t1_id else null end ),
>   count(distinct case when diff_day<=60 then t1_id else null end ),
>   count(distinct case when pay_time is not null then t1_id else null end )
> from (
>   select t1.t1_id,t1.register_date,t1.t5_id,t2.pay_time,t2.t3_id,t3.t3_name,
> datediff(pay_time,register_date) diff_day
>   from (
> select t1_id,pay_time,t3_id from t2
> where state_code = 0 and pay_time>=trunc(NOW(),'Y')
>   and cast(pay_time as date) <> cast(refund_time as date)
>   )t2
>   join t3 on t2.t3_id=t3.t3_id
>   right join t1 on t1.t1_id=t2.t1_id
> )t4
> left join t5 on t4.t5_id=t5.t5_id
> where register_date='20230515'
> group by register_date,t4.t5_id,t5.t5_name,t5.branch_name;{code}
> One of the AggregationNode:
> {noformat}
> 17:AGGREGATE [FINALIZE]
> |  Class 0
> |output: count:merge(t1_id)
> |group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
> |  Class 1
> |output: count:merge(CASE WHEN diff_day = 0 THEN t1_id ELSE NULL END)
> |group 

[jira] [Created] (IMPALA-13086) Cardinality estimate of AggregationNode should consider predicates on group-by columns

2024-05-15 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13086:
---

 Summary: Cardinality estimate of AggregationNode should consider 
predicates on group-by columns
 Key: IMPALA-13086
 URL: https://issues.apache.org/jira/browse/IMPALA-13086
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Quanlong Huang


Consider the following tables:
{code:sql}
CREATE EXTERNAL TABLE t1(
  t1_id bigint,
  t5_id bigint,
  t5_name string,
  register_date string
) stored as textfile;

CREATE EXTERNAL TABLE t2(
  t1_id bigint,
  t3_id bigint,
  pay_time timestamp,
  refund_time timestamp,
  state_code int
) stored as textfile;

CREATE EXTERNAL TABLE t3(
  t3_id bigint,
  t3_name string,
  class_id int
) stored as textfile;

CREATE EXTERNAL TABLE t5( 
  id bigint,
  t5_id bigint,
  t5_name string,
  branch_id bigint,
  branch_name string
) stored as textfile;

alter table t1 set tblproperties('numRows'='6031170829');
alter table t1 set column stats t1_id ('numDVs'='8131016','numNulls'='0');
alter table t1 set column stats t5_id ('numDVs'='389','numNulls'='0');
alter table t1 set column stats t5_name 
('numDVs'='523','numNulls'='85928157','maxsize'='27','avgSize'='17.79120063781738');
alter table t1 set column stats register_date 
('numDVs'='9283','numNulls'='0','maxsize'='8','avgSize'='8');

alter table t2 set tblproperties('numRows'='864341085');
alter table t2 set column stats t1_id ('numDVs'='1007302','numNulls'='0');
alter table t2 set column stats t3_id ('numDVs'='5013','numNulls'='2800503');
alter table t2 set column stats pay_time ('numDVs'='1372020','numNulls'='0');
alter table t2 set column stats refund_time 
('numDVs'='251658','numNulls'='791645118');
alter table t2 set column stats state_code ('numDVs'='8','numNulls'='0');

alter table t3 set tblproperties('numRows'='4452');
alter table t3 set column stats t3_id ('numDVs'='4452','numNulls'='0');
alter table t3 set column stats t3_name 
('numDVs'='4452','numNulls'='0','maxsize'='176','avgSize'='37.60469818115234');
alter table t3 set column stats class_id ('numDVs'='75','numNulls'='0');

alter table t5 set tblproperties('numRows'='2177245');
alter table t5 set column stats t5_id ('numDVs'='826','numNulls'='0');
alter table t5 set column stats t5_name 
('numDVs'='523','numNulls'='0','maxsize'='67','avgSize'='19.12560081481934');
alter table t5 set column stats branch_id ('numDVs'='53','numNulls'='0');
alter table t5 set column stats branch_name 
('numDVs'='55','numNulls'='0','maxsize'='61','avgSize'='16.05229949951172');
{code}
Put a data file to each table to make the stats valid
{code:bash}
echo '2024' > data.txt
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t1
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t2
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t3
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t5
{code}
REFRESH these tables after adding the data files.

The cardinality of AggregationNodes are overestimated in the following query:
{code:sql}
explain select 
  register_date,
  t4.t5_id, 
  t5.t5_name,
  t5.branch_name,
  count(distinct t1_id),
  count(distinct case when diff_day=0 then t1_id else null end ),
  count(distinct case when diff_day<=3 then t1_id else null end ),
  count(distinct case when diff_day<=7 then t1_id else null end ),
  count(distinct case when diff_day<=14 then t1_id else null end ),
  count(distinct case when diff_day<=30 then t1_id else null end ),
  count(distinct case when diff_day<=60 then t1_id else null end ),
  count(distinct case when pay_time is not null then t1_id else null end )
from (
  select t1.t1_id,t1.register_date,t1.t5_id,t2.pay_time,t2.t3_id,t3.t3_name,
datediff(pay_time,register_date) diff_day
  from (
select t1_id,pay_time,t3_id from t2
where state_code = 0 and pay_time>=trunc(NOW(),'Y')
  and cast(pay_time as date) <> cast(refund_time as date)
  )t2
  join t3 on t2.t3_id=t3.t3_id
  right join t1 on t1.t1_id=t2.t1_id
)t4
left join t5 on t4.t5_id=t5.t5_id
where register_date='20230515'
group by register_date,t4.t5_id,t5.t5_name,t5.branch_name;{code}
One of the AggregationNode:
{noformat}
17:AGGREGATE [FINALIZE]
|  Class 0
|output: count:merge(t1_id)
|group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 1
|output: count:merge(CASE WHEN diff_day = 0 THEN t1_id ELSE NULL END)
|group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 2
|output: count:merge(CASE WHEN diff_day <= 3 THEN t1_id ELSE NULL END)
|group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 3
|output: count:merge(CASE WHEN diff_day <= 7 THEN t1_id ELSE NULL END)
|group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 4
|output: count:merge(CASE WHEN diff_day <= 14 THEN t1_id ELSE NULL END)
|group by: register_date, t4.t5_id, 

[jira] [Created] (IMPALA-13086) Cardinality estimate of AggregationNode should consider predicates on group-by columns

2024-05-15 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-13086:
---

 Summary: Cardinality estimate of AggregationNode should consider 
predicates on group-by columns
 Key: IMPALA-13086
 URL: https://issues.apache.org/jira/browse/IMPALA-13086
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Quanlong Huang


Consider the following tables:
{code:sql}
CREATE EXTERNAL TABLE t1(
  t1_id bigint,
  t5_id bigint,
  t5_name string,
  register_date string
) stored as textfile;

CREATE EXTERNAL TABLE t2(
  t1_id bigint,
  t3_id bigint,
  pay_time timestamp,
  refund_time timestamp,
  state_code int
) stored as textfile;

CREATE EXTERNAL TABLE t3(
  t3_id bigint,
  t3_name string,
  class_id int
) stored as textfile;

CREATE EXTERNAL TABLE t5( 
  id bigint,
  t5_id bigint,
  t5_name string,
  branch_id bigint,
  branch_name string
) stored as textfile;

alter table t1 set tblproperties('numRows'='6031170829');
alter table t1 set column stats t1_id ('numDVs'='8131016','numNulls'='0');
alter table t1 set column stats t5_id ('numDVs'='389','numNulls'='0');
alter table t1 set column stats t5_name 
('numDVs'='523','numNulls'='85928157','maxsize'='27','avgSize'='17.79120063781738');
alter table t1 set column stats register_date 
('numDVs'='9283','numNulls'='0','maxsize'='8','avgSize'='8');

alter table t2 set tblproperties('numRows'='864341085');
alter table t2 set column stats t1_id ('numDVs'='1007302','numNulls'='0');
alter table t2 set column stats t3_id ('numDVs'='5013','numNulls'='2800503');
alter table t2 set column stats pay_time ('numDVs'='1372020','numNulls'='0');
alter table t2 set column stats refund_time 
('numDVs'='251658','numNulls'='791645118');
alter table t2 set column stats state_code ('numDVs'='8','numNulls'='0');

alter table t3 set tblproperties('numRows'='4452');
alter table t3 set column stats t3_id ('numDVs'='4452','numNulls'='0');
alter table t3 set column stats t3_name 
('numDVs'='4452','numNulls'='0','maxsize'='176','avgSize'='37.60469818115234');
alter table t3 set column stats class_id ('numDVs'='75','numNulls'='0');

alter table t5 set tblproperties('numRows'='2177245');
alter table t5 set column stats t5_id ('numDVs'='826','numNulls'='0');
alter table t5 set column stats t5_name 
('numDVs'='523','numNulls'='0','maxsize'='67','avgSize'='19.12560081481934');
alter table t5 set column stats branch_id ('numDVs'='53','numNulls'='0');
alter table t5 set column stats branch_name 
('numDVs'='55','numNulls'='0','maxsize'='61','avgSize'='16.05229949951172');
{code}
Put a data file to each table to make the stats valid
{code:bash}
echo '2024' > data.txt
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t1
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t2
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t3
hdfs dfs -put data.txt hdfs://localhost:20500/test-warehouse/lab2.db/t5
{code}
REFRESH these tables after adding the data files.

The cardinality of AggregationNodes are overestimated in the following query:
{code:sql}
explain select 
  register_date,
  t4.t5_id, 
  t5.t5_name,
  t5.branch_name,
  count(distinct t1_id),
  count(distinct case when diff_day=0 then t1_id else null end ),
  count(distinct case when diff_day<=3 then t1_id else null end ),
  count(distinct case when diff_day<=7 then t1_id else null end ),
  count(distinct case when diff_day<=14 then t1_id else null end ),
  count(distinct case when diff_day<=30 then t1_id else null end ),
  count(distinct case when diff_day<=60 then t1_id else null end ),
  count(distinct case when pay_time is not null then t1_id else null end )
from (
  select t1.t1_id,t1.register_date,t1.t5_id,t2.pay_time,t2.t3_id,t3.t3_name,
datediff(pay_time,register_date) diff_day
  from (
select t1_id,pay_time,t3_id from t2
where state_code = 0 and pay_time>=trunc(NOW(),'Y')
  and cast(pay_time as date) <> cast(refund_time as date)
  )t2
  join t3 on t2.t3_id=t3.t3_id
  right join t1 on t1.t1_id=t2.t1_id
)t4
left join t5 on t4.t5_id=t5.t5_id
where register_date='20230515'
group by register_date,t4.t5_id,t5.t5_name,t5.branch_name;{code}
One of the AggregationNode:
{noformat}
17:AGGREGATE [FINALIZE]
|  Class 0
|output: count:merge(t1_id)
|group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 1
|output: count:merge(CASE WHEN diff_day = 0 THEN t1_id ELSE NULL END)
|group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 2
|output: count:merge(CASE WHEN diff_day <= 3 THEN t1_id ELSE NULL END)
|group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 3
|output: count:merge(CASE WHEN diff_day <= 7 THEN t1_id ELSE NULL END)
|group by: register_date, t4.t5_id, t5.t5_name, t5.branch_name
|  Class 4
|output: count:merge(CASE WHEN diff_day <= 14 THEN t1_id ELSE NULL END)
|group by: register_date, t4.t5_id, 

[jira] [Commented] (IMPALA-13077) Equality predicate on partition column and uncorrelated subquery doesn't reduce the cardinality estimate

2024-05-15 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846770#comment-17846770
 ] 

Quanlong Huang commented on IMPALA-13077:
-

It seems doable:
 * catalogd always loads the HMS partition objects and 'numRows' is extracted 
from the parameters: 
[https://github.com/apache/impala/blob/f87c20800de9f7dc74e47aa9a8c0dc878f4f0840/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1415]
 * coordinator always loads all partitions when planning such queries.

Pulling partition level column stats like NDVs will help more since they are 
more accurate than the table level column stats. But using the partition level 
'numRows' already helps a lot in this case.

> Equality predicate on partition column and uncorrelated subquery doesn't 
> reduce the cardinality estimate
> 
>
> Key: IMPALA-13077
> URL: https://issues.apache.org/jira/browse/IMPALA-13077
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>
> Let's say 'part_tbl' is a partitioned table. Its partition key is 'part_key'. 
> Consider the following query:
> {code:sql}
> select xxx from part_tbl
> where part_key=(select ... from dim_tbl);
> {code}
> Its query plan is a JoinNode with two ScanNodes. When estimating the 
> cardinality of the JoinNode, the planner is not aware that 'part_key' is the 
> partition column and the cardinality of the JoinNode should not be larger 
> than the max row count across partitions.
> The recent work in IMPALA-12018 (Consider runtime filter for cardinality 
> reduction) helps in some cases since there are runtime filters on the 
> partition column. But there are still some cases that we overestimate the 
> cardinality. For instance, 'ss_sold_date_sk' is the only partition key of 
> tpcds.store_sales. The following query
> {code:sql}
> select count(*) from tpcds.store_sales
> where ss_sold_date_sk=(
>   select min(d_date_sk) + 1000 from tpcds.date_dim);{code}
> has query plan:
> {noformat}
> +-+
> | Explain String  |
> +-+
> | Max Per-Host Resource Reservation: Memory=18.94MB Threads=6 |
> | Per-Host Resource Estimates: Memory=243MB   |
> | |
> | PLAN-ROOT SINK  |
> | |   |
> | 09:AGGREGATE [FINALIZE] |
> | |  output: count:merge(*)   |
> | |  row-size=8B cardinality=1|
> | |   |
> | 08:EXCHANGE [UNPARTITIONED] |
> | |   |
> | 04:AGGREGATE|
> | |  output: count(*) |
> | |  row-size=8B cardinality=1|
> | |   |
> | 03:HASH JOIN [LEFT SEMI JOIN, BROADCAST]|
> | |  hash predicates: ss_sold_date_sk = min(d_date_sk) + 1000 |
> | |  runtime filters: RF000 <- min(d_date_sk) + 1000  |
> | |  row-size=4B cardinality=2.88M < Should be max(numRows) across 
> partitions
> | |   |
> | |--07:EXCHANGE [BROADCAST]  |
> | |  ||
> | |  06:AGGREGATE [FINALIZE]  |
> | |  |  output: min:merge(d_date_sk)  |
> | |  |  row-size=4B cardinality=1 |
> | |  ||
> | |  05:EXCHANGE [UNPARTITIONED]  |
> | |  ||
> | |  02:AGGREGATE |
> | |  |  output: min(d_date_sk)|
> | |  |  row-size=4B cardinality=1 |
> | |  ||
> | |  01:SCAN HDFS [tpcds.date_dim]|
> | | HDFS partitions=1/1 files=1 size=9.84MB   |
> | | row-size=4B cardinality=73.05K|
> | |   |
> | 00:SCAN HDFS [tpcds.store_sales]|
> |HDFS 

  1   2   3   4   5   6   7   8   9   10   >