Hello experts
As a POC project, I've built a drill cluster on 5 VMs , each with the
following specs
32 GB ram
1 TB storage
16 Cores
Zookeeper quorum & apache drill installed on all 5 nodes. My storage
engine is mongo which has 5 million docs. (Our daily collection is close
to 2.5 million tweets)
*Problem*
My aggregation queries are slow, not fit for my realtime dashboard.
sample query as follows took 12.+ secs,
SELECT count(*) as cnt, actor_preferred_username from tweets where
posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
<='2016-08-10T00.00.00.000Z' group by actor_preferred_username order
by cnt desc limit 10;
Pls. note that,
*In Drill*
1) I've changed the conf/drill-env.sh on each node with following
|export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
2) changed few setting based on the docs
<https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/>
*In Mongo*
data is sharded on 3 servers on shardkey tweet_id which distributes data
evenly on all shards and created compound index on fields used by the
above drill query.
sample data
{
"_id" : ObjectId("58524d507e08dae4c0377b9e"),
"rule_list" : [
"A",
"B",
"C",
"D13"
],
"actor_friends_count" : 40,
"klout_score" : 28,
"actor_favorites_count" : 1697,
"actor_preferred_username" : "_20_xxxx",
"sentiment" : "neu",
"tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
"object_actor_followers_count" : 573,
"actor_posted_time" : "2016-06-24T00:37:54.000Z",
"actor_id" : "id:twitter.com:xxxxxxxxxx",
"actor_display_name" : "xxxxxx",
"retweet_count" : 57,
"hashtag_list" : [
"c",
"d",
"E",
"f",
"VCVC",
"XXCXCXC",
"RRRR"
],
"body" : "some tweet blah blah",
"actor_followers_count" : 21,
"actor_status_count" : 1756,
"verb" : "share",
"posted_time" : "2016-08-01T23:47:43.000Z",
"object_actor_status_count" : 2796,
"lang" : "ar",
"object_actor_preferred_username" : "xxxxxxx",
"original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
"gender" : "male",
"object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
"favorites_count" : 0,
"object_posted_time" : "2016-08-01T22:54:22.000Z",
"object_actor_friends_count" : 69,
"generator_display_name" : "Twitter Web Client",
"object_actor_display_name" : "xxxxxxxxxxxx",
"actor_listed_count" : 0
}
*Questions*
1) How to improve aggregation query performance?
2) Do I also need to do something in mongodb to enhance performance? (I
mean optimize source)
3) Does Apache drill capable of handling/aggregating billions of
documents in real-time?
Your early response is highly appreciated!
Thank you & Kind Regards
Yousuf