Hi Kathleen A bug is reported on JIRA <https://issues.apache.org/jira/browse/DRILL-5139>
Thanks & Regards Yousuf On 12/19/2016 10:03 PM, Kathleen Li wrote:
It seemed in Drill 1.8 parameter store.mongo.all_text_mode=true by default already Try ALTER SESSION SET `exec.enable_union_type` = true; if you still get the errors, you might open one public JIRA with the detailed information. You might also try to use CTAS to create drill tables with parquet format, then run your queries against drill parquet tables to see if the performance getting better. Thanks, Kathleen On 12/19/16, 10:31 AM, "Kathleen Li" <k...@maprtech.com> wrote:Hi Yousuf, Yes in my env, I was set store.mongo.bson.record.reader = true. With one record you provided, the same query works fine for me, the error you got is the schema changes related errors: 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from (select . . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from test) group by hashtag order . . . . . . . . . . . . . . . . . . . . . . .> by cnt desc limit 10; +----------+------+ | hashtag | cnt | +----------+------+ | RRRR | 1 | | VCVC | 1 | | XXCXCXC | 1 | | c | 1 | | d | 1 | | f | 1 | | E | 1 | With that, you might try: alter session set store.mongo.all_text_mode=true and rerun the query to see if you can pass the error Kathleen On 12/18/16, 1:21 AM, "yousuf" <yousufud...@css.org.sa> wrote:Hi Kathleen, Thanks for responding... I've noticed when alter session set store.mongo.bson.record.reader = true; the performance is improved. However, the other queries are failing :(. 0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true; +-------+------------------------------------------+ | ok | summary | +-------+------------------------------------------+ | true | store.mongo.bson.record.reader updated. | +-------+------------------------------------------+ 1 row selected (0.082 seconds) 0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from tweets group by actor_preferred_username order by cnt desc limit 10; +--------+---------------------------+ | cnt | actor_preferred_username | +--------+---------------------------+ | 10770 | mrnota53 | | 6239 | cyberahsokatano | | 4609 | abeerlilak | | 4562 | DaeshAjel3 | | 4523 | Aster__Q | | 4275 | DaeshCrimes | | 4182 | AfwfwefOfwefho | | 3980 | HewarMaftuh2 | | 3698 | DaeshAjel | | 3661 | sarosh_iq | +--------+---------------------------+ 10 rows selected (4.64 seconds) *Failed queries when **store.mongo.bson.record.reader = true;* *0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select flatten(hashtag_list) as hashtag from hashtags) group by hashtag order by cnt desc limit 10; Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a VarChar type when you are using a ValueWriter of type UnionListWriter. Fragment 4:1 [Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010] (java.lang.IllegalArgumentException) You tried to write a VarChar type when you are using a ValueWriter of type UnionListWriter. org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762 org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325 org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91 org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112 org.apache.drill.exec.store.bson.BsonRecordReader.write():75 org.apache.drill.exec.store.mongo.MongoRecordReader.next():186 org.apache.drill.exec.physical.impl.ScanBatch.next():178 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97 org.apache.drill.exec.record.AbstractRecordBatch.next():142 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor.run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor$Worker.run():617 java.lang.Thread.run():745 (state=,code=0) * On 12/16/2016 01:55 AM, Kathleen Li wrote:In my env, first time took about 1.6s, second time only took 0.5s 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10; +------+---------------------------+ | cnt | actor_preferred_username | +------+---------------------------+ | 1 | _20_xxxx | +------+---------------------------+ 1 row selected (1.585 seconds) 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10; +------+---------------------------+ | cnt | actor_preferred_username | +------+---------------------------+ | 1 | _20_xxxx | +------+---------------------------+ 1 row selected (0.505 seconds) I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile. As what Dechang suggested you can check profile and see the time mainly spent on which step. Kathleen On 12/15/16, 2:27 PM, "Dechang Gu" <d...@maprtech.com> wrote:Yousuf, Which version of drill are you running? Can you share the profile of the query? Thanks, Dechang On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufud...@css.org.sa> wrote:Hello experts As a POC project, I've built a drill cluster on 5 VMs , each with the following specs 32 GB ram 1 TB storage 16 Cores Zookeeper quorum & apache drill installed on all 5 nodes. My storage engine is mongo which has 5 million docs. (Our daily collection is close to 2.5 million tweets) *Problem* My aggregation queries are slow, not fit for my realtime dashboard. sample query as follows took 12.+ secs, SELECT count(*) as cnt, actor_preferred_username from tweets where posted_time >= '2016-08-01T00.00.00.000Z' and posted_time <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by cnt desc limit 10; Pls. note that, *In Drill* 1) I've changed the conf/drill-env.sh on each node with following |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}| 2) changed few setting based on the docs <https://drill.apache.org/docs /sort-based-and-hash-based-memory-constrained-operators/> *In Mongo* data is sharded on 3 servers on shardkey tweet_id which distributes data evenly on all shards and created compound index on fields used by the above drill query. sample data { "_id" : ObjectId("58524d507e08dae4c0377b9e"), "rule_list" : [ "A", "B", "C", "D13" ], "actor_friends_count" : 40, "klout_score" : 28, "actor_favorites_count" : 1697, "actor_preferred_username" : "_20_xxxx", "sentiment" : "neu", "tweet_id" : "tag:search.twitter.com,2005:000xxxxx", "object_actor_followers_count" : 573, "actor_posted_time" : "2016-06-24T00:37:54.000Z", "actor_id" : "id:twitter.com:xxxxxxxxxx", "actor_display_name" : "xxxxxx", "retweet_count" : 57, "hashtag_list" : [ "c", "d", "E", "f", "VCVC", "XXCXCXC", "RRRR" ], "body" : "some tweet blah blah", "actor_followers_count" : 21, "actor_status_count" : 1756, "verb" : "share", "posted_time" : "2016-08-01T23:47:43.000Z", "object_actor_status_count" : 2796, "lang" : "ar", "object_actor_preferred_username" : "xxxxxxx", "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx", "gender" : "male", "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx", "favorites_count" : 0, "object_posted_time" : "2016-08-01T22:54:22.000Z", "object_actor_friends_count" : 69, "generator_display_name" : "Twitter Web Client", "object_actor_display_name" : "xxxxxxxxxxxx", "actor_listed_count" : 0 } *Questions* 1) How to improve aggregation query performance? 2) Do I also need to do something in mongodb to enhance performance? (I mean optimize source) 3) Does Apache drill capable of handling/aggregating billions of documents in real-time? Your early response is highly appreciated! Thank you & Kind Regards Yousuf