In my env, first time took about 1.6s, second time only took 0.5s 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10; +------+---------------------------+ | cnt | actor_preferred_username | +------+---------------------------+ | 1 | _20_xxxx | +------+---------------------------+ 1 row selected (1.585 seconds) 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10; +------+---------------------------+ | cnt | actor_preferred_username | +------+---------------------------+ | 1 | _20_xxxx | +------+---------------------------+ 1 row selected (0.505 seconds)
I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile. As what Dechang suggested you can check profile and see the time mainly spent on which step. Kathleen On 12/15/16, 2:27 PM, "Dechang Gu" <d...@maprtech.com> wrote: >Yousuf, >Which version of drill are you running? >Can you share the profile of the query? > >Thanks, >Dechang > > > >On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufud...@css.org.sa> wrote: > >> Hello experts >> >> As a POC project, I've built a drill cluster on 5 VMs , each with the >> following specs >> >> 32 GB ram >> >> 1 TB storage >> >> 16 Cores >> >> Zookeeper quorum & apache drill installed on all 5 nodes. My storage >> engine is mongo which has 5 million docs. (Our daily collection is close to >> 2.5 million tweets) >> >> *Problem* >> >> My aggregation queries are slow, not fit for my realtime dashboard. sample >> query as follows took 12.+ secs, >> >> >> SELECT count(*) as cnt, actor_preferred_username from tweets where >> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time >> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by >> cnt desc limit 10; >> >> Pls. note that, >> >> *In Drill* >> >> 1) I've changed the conf/drill-env.sh on each node with following >> >> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export >> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}| >> >> 2) changed few setting based on the docs <https://drill.apache.org/docs >> /sort-based-and-hash-based-memory-constrained-operators/> >> >> *In Mongo* >> >> data is sharded on 3 servers on shardkey tweet_id which distributes data >> evenly on all shards and created compound index on fields used by the above >> drill query. >> >> >> sample data >> >> { >> "_id" : ObjectId("58524d507e08dae4c0377b9e"), >> "rule_list" : [ >> "A", >> "B", >> "C", >> "D13" >> ], >> "actor_friends_count" : 40, >> "klout_score" : 28, >> "actor_favorites_count" : 1697, >> "actor_preferred_username" : "_20_xxxx", >> "sentiment" : "neu", >> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx", >> "object_actor_followers_count" : 573, >> "actor_posted_time" : "2016-06-24T00:37:54.000Z", >> "actor_id" : "id:twitter.com:xxxxxxxxxx", >> "actor_display_name" : "xxxxxx", >> "retweet_count" : 57, >> "hashtag_list" : [ >> "c", >> "d", >> "E", >> "f", >> "VCVC", >> "XXCXCXC", >> "RRRR" >> ], >> "body" : "some tweet blah blah", >> "actor_followers_count" : 21, >> "actor_status_count" : 1756, >> "verb" : "share", >> "posted_time" : "2016-08-01T23:47:43.000Z", >> "object_actor_status_count" : 2796, >> "lang" : "ar", >> "object_actor_preferred_username" : "xxxxxxx", >> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx", >> "gender" : "male", >> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx", >> "favorites_count" : 0, >> "object_posted_time" : "2016-08-01T22:54:22.000Z", >> "object_actor_friends_count" : 69, >> "generator_display_name" : "Twitter Web Client", >> "object_actor_display_name" : "xxxxxxxxxxxx", >> "actor_listed_count" : 0 >> } >> >> >> >> *Questions* >> >> 1) How to improve aggregation query performance? >> >> 2) Do I also need to do something in mongodb to enhance performance? (I >> mean optimize source) >> >> 3) Does Apache drill capable of handling/aggregating billions of documents >> in real-time? >> >> Your early response is highly appreciated! >> >> Thank you & Kind Regards >> >> Yousuf >> >> >> >>