Re: Aggregation performance

yousuf Sun, 18 Dec 2016 01:21:56 -0800

Hi Kathleen,

Thanks for responding...

I've noticed when alter session set store.mongo.bson.record.reader =true; the performance is improved. However, the other queries arefailing :(.



0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
+-------+------------------------------------------+
|  ok   |                 summary                  |
+-------+------------------------------------------+
| true  | store.mongo.bson.record.reader updated.  |
+-------+------------------------------------------+
1 row selected (0.082 seconds)

0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username fromtweets group by actor_preferred_username order by cnt desc limit 10;

+--------+---------------------------+
|  cnt   | actor_preferred_username  |
+--------+---------------------------+
| 10770  | mrnota53                  |
| 6239   | cyberahsokatano           |
| 4609   | abeerlilak                |
| 4562   | DaeshAjel3                |
| 4523   | Aster__Q                  |
| 4275   | DaeshCrimes               |
| 4182   | AfwfwefOfwefho            |
| 3980   | HewarMaftuh2              |
| 3698   | DaeshAjel                 |
| 3661   | sarosh_iq                 |
+--------+---------------------------+
10 rows selected (4.64 seconds)


*Failed queries when **store.mongo.bson.record.reader = true;*

*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (selectflatten(hashtag_list) as hashtag from hashtags) group by hashtag orderby cnt desc limit 10;Error: SYSTEM ERROR: IllegalArgumentException: You tried to write aVarChar type when you are using a ValueWriter of type UnionListWriter.


Fragment 4:1

[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]

(java.lang.IllegalArgumentException) You tried to write a VarChartype when you are using a ValueWriter of type UnionListWriter.

org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
    org.apache.drill.exec.store.bson.BsonRecordReader.write():75
org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
    org.apache.drill.exec.physical.impl.ScanBatch.next():178
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
    org.apache.drill.exec.record.AbstractRecordBatch.next():142
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1142
    java.util.concurrent.ThreadPoolExecutor$Worker.run():617
    java.lang.Thread.run():745 (state=,code=0)

*


On 12/16/2016 01:55 AM, Kathleen Li wrote:

In my env, first time took about 1.6s, second time only took 0.5s
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, 
actor_preferred_username from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= 
'2016-08-01T00.00.00.000Z' and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' 
group by actor_preferred_username   order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt  | actor_preferred_username  |
+------+---------------------------+
| 1    | _20_xxxx                  |
+------+---------------------------+
1 row selected (1.585 seconds)
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, 
actor_preferred_username from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= 
'2016-08-01T00.00.00.000Z' and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' 
group by actor_preferred_username   order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt  | actor_preferred_username  |
+------+---------------------------+
| 1    | _20_xxxx                  |
+------+---------------------------+
1 row selected (0.505 seconds)






I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using 
one fragment , see attached drill profile.

As what Dechang suggested  you can check profile and see the time mainly spent 
on which step.

Kathleen


On 12/15/16, 2:27 PM, "Dechang Gu" <d...@maprtech.com> wrote:

Yousuf,
Which version of drill are you running?
Can you share the profile of the query?

Thanks,
Dechang



On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufud...@css.org.sa> wrote:

Hello experts

As a POC project, I've built a drill cluster on 5 VMs , each with the
following specs

32 GB ram

1 TB storage

16 Cores

Zookeeper quorum & apache drill installed on all 5 nodes. My storage
engine is mongo which has 5 million docs. (Our daily collection is close to
2.5 million tweets)

*Problem*

My aggregation queries are slow, not fit for my realtime dashboard. sample
query as follows took 12.+ secs,


SELECT count(*) as cnt, actor_preferred_username from tweets where
posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
<='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
cnt desc limit 10;

Pls. note that,

*In Drill*

1) I've changed the conf/drill-env.sh on each node with following

|export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|

2) changed few setting based on the docs <https://drill.apache.org/docs
/sort-based-and-hash-based-memory-constrained-operators/>

*In Mongo*

data is sharded on 3 servers on shardkey tweet_id which distributes data
evenly on all shards and created compound index on fields used by the above
drill query.


sample data

{
     "_id" : ObjectId("58524d507e08dae4c0377b9e"),
     "rule_list" : [
         "A",
         "B",
         "C",
         "D13"
     ],
     "actor_friends_count" : 40,
     "klout_score" : 28,
     "actor_favorites_count" : 1697,
     "actor_preferred_username" : "_20_xxxx",
     "sentiment" : "neu",
     "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
     "object_actor_followers_count" : 573,
     "actor_posted_time" : "2016-06-24T00:37:54.000Z",
     "actor_id" : "id:twitter.com:xxxxxxxxxx",
     "actor_display_name" : "xxxxxx",
     "retweet_count" : 57,
     "hashtag_list" : [
         "c",
         "d",
         "E",
         "f",
         "VCVC",
         "XXCXCXC",
         "RRRR"
     ],
     "body" : "some tweet blah blah",
     "actor_followers_count" : 21,
     "actor_status_count" : 1756,
     "verb" : "share",
     "posted_time" : "2016-08-01T23:47:43.000Z",
     "object_actor_status_count" : 2796,
     "lang" : "ar",
     "object_actor_preferred_username" : "xxxxxxx",
     "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
     "gender" : "male",
     "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
     "favorites_count" : 0,
     "object_posted_time" : "2016-08-01T22:54:22.000Z",
     "object_actor_friends_count" : 69,
     "generator_display_name" : "Twitter Web Client",
     "object_actor_display_name" : "xxxxxxxxxxxx",
     "actor_listed_count" : 0
}



*Questions*

1) How to improve aggregation query performance?

2) Do I also need to do something in mongodb to enhance performance? (I
mean optimize source)

3) Does Apache drill capable of handling/aggregating billions of documents
in real-time?

Your early response is highly appreciated!

Thank you & Kind Regards

Yousuf

Re: Aggregation performance

Reply via email to