Hello experts

As a POC project, I've built a drill cluster on 5 VMs , each with the following specs

32 GB ram

1 TB storage

16 Cores

Zookeeper quorum & apache drill installed on all 5 nodes. My storage engine is mongo which has 5 million docs. (Our daily collection is close to 2.5 million tweets)

*Problem*

My aggregation queries are slow, not fit for my realtime dashboard. sample query as follows took 12.+ secs,


SELECT count(*) as cnt, actor_preferred_username from tweets where posted_time >= '2016-08-01T00.00.00.000Z' and posted_time <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by cnt desc limit 10;

Pls. note that,

*In Drill*

1) I've changed the conf/drill-env.sh on each node with following

|export DRILL_HEAP=${DRILL_HEAP:-"8G”} export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|

2) changed few setting based on the docs <https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/>

*In Mongo*

data is sharded on 3 servers on shardkey tweet_id which distributes data evenly on all shards and created compound index on fields used by the above drill query.


sample data

{
    "_id" : ObjectId("58524d507e08dae4c0377b9e"),
    "rule_list" : [
        "A",
        "B",
        "C",
        "D13"
    ],
    "actor_friends_count" : 40,
    "klout_score" : 28,
    "actor_favorites_count" : 1697,
    "actor_preferred_username" : "_20_xxxx",
    "sentiment" : "neu",
    "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
    "object_actor_followers_count" : 573,
    "actor_posted_time" : "2016-06-24T00:37:54.000Z",
    "actor_id" : "id:twitter.com:xxxxxxxxxx",
    "actor_display_name" : "xxxxxx",
    "retweet_count" : 57,
    "hashtag_list" : [
        "c",
        "d",
        "E",
        "f",
        "VCVC",
        "XXCXCXC",
        "RRRR"
    ],
    "body" : "some tweet blah blah",
    "actor_followers_count" : 21,
    "actor_status_count" : 1756,
    "verb" : "share",
    "posted_time" : "2016-08-01T23:47:43.000Z",
    "object_actor_status_count" : 2796,
    "lang" : "ar",
    "object_actor_preferred_username" : "xxxxxxx",
    "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
    "gender" : "male",
    "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
    "favorites_count" : 0,
    "object_posted_time" : "2016-08-01T22:54:22.000Z",
    "object_actor_friends_count" : 69,
    "generator_display_name" : "Twitter Web Client",
    "object_actor_display_name" : "xxxxxxxxxxxx",
    "actor_listed_count" : 0
}



*Questions*

1) How to improve aggregation query performance?

2) Do I also need to do something in mongodb to enhance performance? (I mean optimize source)

3) Does Apache drill capable of handling/aggregating billions of documents in real-time?

Your early response is highly appreciated!

Thank you & Kind Regards

Yousuf



Reply via email to