analyze table columns none refresh metadata performance

James Turton Sun, 31 May 2020 01:48:04 -0700

Hi

I have a directory of 387 Parquet files that amount to a single data set
of 131Gb.  Querying them with Drill works nicely.  When I try to collect
metadata for this table with


|analyze table columns none refresh metadata|

that command uses a mind-boggling of amount of CPU time.  At least the
order of 10 CPU-hours and probably the order of 100 CPU-hours [1].  It
cannot require that much CPU time to collect metadata from a few hundred
Parquet files.  Surely?  I'd /like/ to collect statistics too for some
columns but I've had to forgo that so far because of how slow this
command is.

[1] This is on a VMware guest with 10 vCPUs that are reported as Intel
Xeon CPU E5-2690 v4 @ 2.60GHz

analyze table columns none refresh metadata performance

Reply via email to