Re: Question about Utilities#getInputPaths

Gopal Vijayaraghavan Wed, 05 Oct 2016 11:37:09 -0700

(always helpful to call out a version, I'm going to assume 1.2)

>    select * from (select count(1) from T union all select count(1) from T2) x;


>    I have to admit that I don't quite understand that. Would it mean that we'd
>   only get a single row if we left out this empty path?

AFAIK, this is a bit of historical stuff from MR, where a 0 task job is not 
valid (in Tez, it is).

I know of at least one fix for metadata optimizations for partitioned table, 
which does this faster (but is not in Apache, AFAIK)

https://issues.apache.org/jira/browse/HIVE-10596

>    I do not understand the internals of query planning and execution well
>   enough but if someone has time to explain it to me I'd be very grateful.

There's a DEBUG level log named <PERFLOG> that would be useful in debugging why 
this is slow.

--hiveconf DEBUG,DRFA should get you the split of times within a query.

> For simple queries like SELECT * FROM T LIMIT 10 I'm seeing 5-10min runtimes 
> just
> because of this overhead.

There are 2 optimizers you can disable and try this out.

set hive.optimize.metadataonly=false;
set hive.fetch.task.conversion=minimal; (or none)

The first one prevents creation of dummy files for a simple query like the 
count(1).

The second one prevents an optimizer check which will sum up the file sizes of 
all files till it reaches 1Gb before disabling the fetch codepath.

Cheers,
Gopal

Re: Question about Utilities#getInputPaths

Reply via email to