Simple query on 150 billion records

François Méthot Mon, 04 Apr 2016 07:19:17 -0700

Hi,

  Querying 150 Billion records spread over ~21 000 parquets stored in hdfs
on 13 nodes (6 cores each, Max Dir. Mem: 32GB, Max Heap 8 GB).


Is their a known issue or drill limitation that would explain why the first
query below can't return the expected single row and aggregation ?

create table ANALYSIS_RESULT as (
select to_date(to_timestamp((SECONDS)), count(1)
from hdfs.`/data/
where Int32Field2=123456 or Int32Field2=4567898
group by to_date(to_timestamp((SECONDS)));

After *20 hours*, SYSTEM ERROR: Foreman Exception: One more more nodes lost
connectivity during query.


If we do the query in 2 steps:
create table ANALYSIS_RESULT as (
select Int32Field1 as SECONDS from hdfs.`/data/` where Int32Field2=123456
or Int32Field2=4567898);

result was returned in *43 minutes* ( a single record ).

select to_date(to_timestamp((SECONDS)), count(1)
from ANALYSIS_RESULT
group by to_date(to_timestamp((SECONDS));

Aggregation of that single record is of course done in  < 1 second.
   2016-04-04          1



I also tried
select to_date(to_timestamp((SECONDS)), count(1)  from (
select Int32Field1 as SECONDS
from hdfs.`/data/`
where Int32Field2=123456 or Int32Field2=4567898)
group by o_date(to_timestamp((SECONDS))

Same thing: After *21 hours*, SYSTEM ERROR: Foreman Exception: One more
more nodes lost connectivity during query.


Thanks for your help
Francois

Simple query on 150 billion records

Reply via email to