Re: Aggregation performance

yousuf Tue, 20 Dec 2016 00:37:51 -0800

Hi Kathleen

A bug is reported on JIRA <https://issues.apache.org/jira/browse/DRILL-5139>


Thanks & Regards
Yousuf


On 12/19/2016 10:03 PM, Kathleen Li wrote:

It seemed in Drill 1.8  parameter store.mongo.all_text_mode=true by default 
already

  Try ALTER SESSION SET `exec.enable_union_type` = true; if you still get the 
errors, you might open one public JIRA with the detailed information.


You might also try to use CTAS to create drill tables with parquet format, then 
run your queries against drill parquet tables to see if the performance getting 
better.


Thanks,

Kathleen




On 12/19/16, 10:31 AM, "Kathleen Li" <k...@maprtech.com> wrote:

Hi Yousuf,


Yes in my env, I was set store.mongo.bson.record.reader = true.
With one record you provided, the same query works fine for me,  the error you 
got is the schema changes related errors:
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt 
from (select
. . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag 
from test) group by hashtag order
. . . . . . . . . . . . . . . . . . . . . . .> by cnt desc  limit 10;
+----------+------+
| hashtag  | cnt  |
+----------+------+
| RRRR     | 1    |
| VCVC     | 1    |
| XXCXCXC  | 1    |
| c        | 1    |
| d        | 1    |
| f        | 1    |
| E        | 1    |


With that, you might try:  alter session set store.mongo.all_text_mode=true and 
rerun the query to see if you can pass the error


Kathleen







On 12/18/16, 1:21 AM, "yousuf" <yousufud...@css.org.sa> wrote:

Hi Kathleen,

Thanks for responding...

I've noticed when  alter session set store.mongo.bson.record.reader =
true; the performance is improved. However, the other queries are
failing :(.


0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
+-------+------------------------------------------+
|  ok   |                 summary                  |
+-------+------------------------------------------+
| true  | store.mongo.bson.record.reader updated.  |
+-------+------------------------------------------+
1 row selected (0.082 seconds)
0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from
tweets group by actor_preferred_username order by cnt desc limit 10;
+--------+---------------------------+
|  cnt   | actor_preferred_username  |
+--------+---------------------------+
| 10770  | mrnota53                  |
| 6239   | cyberahsokatano           |
| 4609   | abeerlilak                |
| 4562   | DaeshAjel3                |
| 4523   | Aster__Q                  |
| 4275   | DaeshCrimes               |
| 4182   | AfwfwefOfwefho            |
| 3980   | HewarMaftuh2              |
| 3698   | DaeshAjel                 |
| 3661   | sarosh_iq                 |
+--------+---------------------------+
10 rows selected (4.64 seconds)


*Failed queries when **store.mongo.bson.record.reader = true;*

*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select
flatten(hashtag_list) as hashtag from hashtags) group by hashtag order
by cnt desc  limit 10;
Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
VarChar type when you are using a ValueWriter of type UnionListWriter.

Fragment 4:1

[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]

   (java.lang.IllegalArgumentException) You tried to write a VarChar
type when you are using a ValueWriter of type UnionListWriter.
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
     org.apache.drill.exec.store.bson.BsonRecordReader.write():75
org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
     org.apache.drill.exec.physical.impl.ScanBatch.next():178
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
     org.apache.drill.exec.record.AbstractRecordBatch.next():162
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
     org.apache.drill.exec.record.AbstractRecordBatch.next():162
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
     org.apache.drill.exec.record.AbstractRecordBatch.next():142
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
     org.apache.drill.exec.record.AbstractRecordBatch.next():162
     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
     java.security.AccessController.doPrivileged():-2
     javax.security.auth.Subject.doAs():422
     org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
     org.apache.drill.common.SelfCleaningRunnable.run():38
     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
     java.lang.Thread.run():745 (state=,code=0)

*


On 12/16/2016 01:55 AM, Kathleen Li wrote:

In my env, first time took about 1.6s, second time only took 0.5s
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, 
actor_preferred_username from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= 
'2016-08-01T00.00.00.000Z' and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' 
group by actor_preferred_username   order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt  | actor_preferred_username  |
+------+---------------------------+
| 1    | _20_xxxx                  |
+------+---------------------------+
1 row selected (1.585 seconds)
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, 
actor_preferred_username from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= 
'2016-08-01T00.00.00.000Z' and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' 
group by actor_preferred_username   order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt  | actor_preferred_username  |
+------+---------------------------+
| 1    | _20_xxxx                  |
+------+---------------------------+
1 row selected (0.505 seconds)






I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using 
one fragment , see attached drill profile.

As what Dechang suggested  you can check profile and see the time mainly spent 
on which step.

Kathleen


On 12/15/16, 2:27 PM, "Dechang Gu" <d...@maprtech.com> wrote:

Yousuf,
Which version of drill are you running?
Can you share the profile of the query?

Thanks,
Dechang



On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufud...@css.org.sa> wrote:

Hello experts

As a POC project, I've built a drill cluster on 5 VMs , each with the
following specs

32 GB ram

1 TB storage

16 Cores

Zookeeper quorum & apache drill installed on all 5 nodes. My storage
engine is mongo which has 5 million docs. (Our daily collection is close to
2.5 million tweets)

*Problem*

My aggregation queries are slow, not fit for my realtime dashboard. sample
query as follows took 12.+ secs,


SELECT count(*) as cnt, actor_preferred_username from tweets where
posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
<='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
cnt desc limit 10;

Pls. note that,

*In Drill*

1) I've changed the conf/drill-env.sh on each node with following

|export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|

2) changed few setting based on the docs <https://drill.apache.org/docs
/sort-based-and-hash-based-memory-constrained-operators/>

*In Mongo*

data is sharded on 3 servers on shardkey tweet_id which distributes data
evenly on all shards and created compound index on fields used by the above
drill query.


sample data

{
      "_id" : ObjectId("58524d507e08dae4c0377b9e"),
      "rule_list" : [
          "A",
          "B",
          "C",
          "D13"
      ],
      "actor_friends_count" : 40,
      "klout_score" : 28,
      "actor_favorites_count" : 1697,
      "actor_preferred_username" : "_20_xxxx",
      "sentiment" : "neu",
      "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
      "object_actor_followers_count" : 573,
      "actor_posted_time" : "2016-06-24T00:37:54.000Z",
      "actor_id" : "id:twitter.com:xxxxxxxxxx",
      "actor_display_name" : "xxxxxx",
      "retweet_count" : 57,
      "hashtag_list" : [
          "c",
          "d",
          "E",
          "f",
          "VCVC",
          "XXCXCXC",
          "RRRR"
      ],
      "body" : "some tweet blah blah",
      "actor_followers_count" : 21,
      "actor_status_count" : 1756,
      "verb" : "share",
      "posted_time" : "2016-08-01T23:47:43.000Z",
      "object_actor_status_count" : 2796,
      "lang" : "ar",
      "object_actor_preferred_username" : "xxxxxxx",
      "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
      "gender" : "male",
      "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
      "favorites_count" : 0,
      "object_posted_time" : "2016-08-01T22:54:22.000Z",
      "object_actor_friends_count" : 69,
      "generator_display_name" : "Twitter Web Client",
      "object_actor_display_name" : "xxxxxxxxxxxx",
      "actor_listed_count" : 0
}



*Questions*

1) How to improve aggregation query performance?

2) Do I also need to do something in mongodb to enhance performance? (I
mean optimize source)

3) Does Apache drill capable of handling/aggregating billions of documents
in real-time?

Your early response is highly appreciated!

Thank you & Kind Regards

Yousuf

Re: Aggregation performance

Reply via email to