Hi All, I'm using Spark/Shark as the foundation for some reporting that I'm doing and have a customers table with approximately 3 million rows that I've cached in memory.
I've also created a partitioned table that I've also cached in memory on a per day basis FROM customers_cached INSERT OVERWRITE TABLE part_customers_cached PARTITION(createday) SELECT id,email,dt_cr, to_date(dt_cr) as createday where dt_cr>unix_timestamp('2013-01-01 00:00:00') and dt_cr<unix_timestamp('2013-12-31 23:59:59'); set exec.dynamic.partition=true; set exec.dynamic.partition.mode=nonstrict; however when I run the following basic tests I get this type of performance [localhost:10000] shark> select count(*) from part_customers_cached where createday >= '2014-08-01' and createday <= '2014-12-06'; 37204 Time taken (including network latency): 3.131 seconds [localhost:10000] shark> SELECT count(*) from customers_cached where dt_cr>unix_timestamp('2013-08-01 00:00:00') and dt_cr<unix_timestamp('2013-12-06 23:59:59'); 37204 Time taken (including network latency): 1.538 seconds I'm running this on a cluster with one master and two slaves and was hoping that the partitioned table would be noticeably faster but it looks as though the partitioning has slowed things down... Is this the case, or is there some additional configuration that I need to do to speed things up? Best Wishes, Gordon