Hi Kathleen,
Thanks for responding...
I've noticed when alter session set store.mongo.bson.record.reader =
true; the performance is improved. However, the other queries are
failing :(.
0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
+-------+------------------------------------------+
| ok | summary |
+-------+------------------------------------------+
| true | store.mongo.bson.record.reader updated. |
+-------+------------------------------------------+
1 row selected (0.082 seconds)
0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from
tweets group by actor_preferred_username order by cnt desc limit 10;
+--------+---------------------------+
| cnt | actor_preferred_username |
+--------+---------------------------+
| 10770 | mrnota53 |
| 6239 | cyberahsokatano |
| 4609 | abeerlilak |
| 4562 | DaeshAjel3 |
| 4523 | Aster__Q |
| 4275 | DaeshCrimes |
| 4182 | AfwfwefOfwefho |
| 3980 | HewarMaftuh2 |
| 3698 | DaeshAjel |
| 3661 | sarosh_iq |
+--------+---------------------------+
10 rows selected (4.64 seconds)
*Failed queries when **store.mongo.bson.record.reader = true;*
*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select
flatten(hashtag_list) as hashtag from hashtags) group by hashtag order
by cnt desc limit 10;
Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
VarChar type when you are using a ValueWriter of type UnionListWriter.
Fragment 4:1
[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
(java.lang.IllegalArgumentException) You tried to write a VarChar
type when you are using a ValueWriter of type UnionListWriter.
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
org.apache.drill.exec.store.bson.BsonRecordReader.write():75
org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
org.apache.drill.exec.physical.impl.ScanBatch.next():178
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
org.apache.drill.exec.record.AbstractRecordBatch.next():142
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
*
On 12/16/2016 01:55 AM, Kathleen Li wrote:
In my env, first time took about 1.6s, second time only took 0.5s
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt,
actor_preferred_username from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >=
'2016-08-01T00.00.00.000Z' and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z'
group by actor_preferred_username order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt | actor_preferred_username |
+------+---------------------------+
| 1 | _20_xxxx |
+------+---------------------------+
1 row selected (1.585 seconds)
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt,
actor_preferred_username from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >=
'2016-08-01T00.00.00.000Z' and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z'
group by actor_preferred_username order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt | actor_preferred_username |
+------+---------------------------+
| 1 | _20_xxxx |
+------+---------------------------+
1 row selected (0.505 seconds)
I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using
one fragment , see attached drill profile.
As what Dechang suggested you can check profile and see the time mainly spent
on which step.
Kathleen
On 12/15/16, 2:27 PM, "Dechang Gu" <d...@maprtech.com> wrote:
Yousuf,
Which version of drill are you running?
Can you share the profile of the query?
Thanks,
Dechang
On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufud...@css.org.sa> wrote:
Hello experts
As a POC project, I've built a drill cluster on 5 VMs , each with the
following specs
32 GB ram
1 TB storage
16 Cores
Zookeeper quorum & apache drill installed on all 5 nodes. My storage
engine is mongo which has 5 million docs. (Our daily collection is close to
2.5 million tweets)
*Problem*
My aggregation queries are slow, not fit for my realtime dashboard. sample
query as follows took 12.+ secs,
SELECT count(*) as cnt, actor_preferred_username from tweets where
posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
<='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
cnt desc limit 10;
Pls. note that,
*In Drill*
1) I've changed the conf/drill-env.sh on each node with following
|export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
2) changed few setting based on the docs <https://drill.apache.org/docs
/sort-based-and-hash-based-memory-constrained-operators/>
*In Mongo*
data is sharded on 3 servers on shardkey tweet_id which distributes data
evenly on all shards and created compound index on fields used by the above
drill query.
sample data
{
"_id" : ObjectId("58524d507e08dae4c0377b9e"),
"rule_list" : [
"A",
"B",
"C",
"D13"
],
"actor_friends_count" : 40,
"klout_score" : 28,
"actor_favorites_count" : 1697,
"actor_preferred_username" : "_20_xxxx",
"sentiment" : "neu",
"tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
"object_actor_followers_count" : 573,
"actor_posted_time" : "2016-06-24T00:37:54.000Z",
"actor_id" : "id:twitter.com:xxxxxxxxxx",
"actor_display_name" : "xxxxxx",
"retweet_count" : 57,
"hashtag_list" : [
"c",
"d",
"E",
"f",
"VCVC",
"XXCXCXC",
"RRRR"
],
"body" : "some tweet blah blah",
"actor_followers_count" : 21,
"actor_status_count" : 1756,
"verb" : "share",
"posted_time" : "2016-08-01T23:47:43.000Z",
"object_actor_status_count" : 2796,
"lang" : "ar",
"object_actor_preferred_username" : "xxxxxxx",
"original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
"gender" : "male",
"object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
"favorites_count" : 0,
"object_posted_time" : "2016-08-01T22:54:22.000Z",
"object_actor_friends_count" : 69,
"generator_display_name" : "Twitter Web Client",
"object_actor_display_name" : "xxxxxxxxxxxx",
"actor_listed_count" : 0
}
*Questions*
1) How to improve aggregation query performance?
2) Do I also need to do something in mongodb to enhance performance? (I
mean optimize source)
3) Does Apache drill capable of handling/aggregating billions of documents
in real-time?
Your early response is highly appreciated!
Thank you & Kind Regards
Yousuf