Hi Yousuf,
Yes in my env, I was set store.mongo.bson.record.reader = true. With one record you provided, the same query works fine for me, the error you got is the schema changes related errors: 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from (select . . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from test) group by hashtag order . . . . . . . . . . . . . . . . . . . . . . .> by cnt desc limit 10; +----------+------+ | hashtag | cnt | +----------+------+ | RRRR | 1 | | VCVC | 1 | | XXCXCXC | 1 | | c | 1 | | d | 1 | | f | 1 | | E | 1 | With that, you might try: alter session set store.mongo.all_text_mode=true and rerun the query to see if you can pass the error Kathleen On 12/18/16, 1:21 AM, "yousuf" <yousufud...@css.org.sa> wrote: >Hi Kathleen, > >Thanks for responding... > >I've noticed when alter session set store.mongo.bson.record.reader = >true; the performance is improved. However, the other queries are >failing :(. > > >0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true; >+-------+------------------------------------------+ >| ok | summary | >+-------+------------------------------------------+ >| true | store.mongo.bson.record.reader updated. | >+-------+------------------------------------------+ >1 row selected (0.082 seconds) >0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from >tweets group by actor_preferred_username order by cnt desc limit 10; >+--------+---------------------------+ >| cnt | actor_preferred_username | >+--------+---------------------------+ >| 10770 | mrnota53 | >| 6239 | cyberahsokatano | >| 4609 | abeerlilak | >| 4562 | DaeshAjel3 | >| 4523 | Aster__Q | >| 4275 | DaeshCrimes | >| 4182 | AfwfwefOfwefho | >| 3980 | HewarMaftuh2 | >| 3698 | DaeshAjel | >| 3661 | sarosh_iq | >+--------+---------------------------+ >10 rows selected (4.64 seconds) > > >*Failed queries when **store.mongo.bson.record.reader = true;* > >*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select >flatten(hashtag_list) as hashtag from hashtags) group by hashtag order >by cnt desc limit 10; >Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a >VarChar type when you are using a ValueWriter of type UnionListWriter. > >Fragment 4:1 > >[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010] > > (java.lang.IllegalArgumentException) You tried to write a VarChar >type when you are using a ValueWriter of type UnionListWriter. >org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762 >org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325 >org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91 >org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275 >org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167 >org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112 > org.apache.drill.exec.store.bson.BsonRecordReader.write():75 >org.apache.drill.exec.store.mongo.MongoRecordReader.next():186 > org.apache.drill.exec.physical.impl.ScanBatch.next():178 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 >org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 >org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 >org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 >org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 >org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97 > org.apache.drill.exec.record.AbstractRecordBatch.next():142 > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > org.apache.drill.exec.record.AbstractRecordBatch.next():109 >org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 >org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > org.apache.drill.exec.physical.impl.BaseRootExec.next():104 >org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92 > org.apache.drill.exec.physical.impl.BaseRootExec.next():94 >org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 >org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226 > java.security.AccessController.doPrivileged():-2 > javax.security.auth.Subject.doAs():422 > org.apache.hadoop.security.UserGroupInformation.doAs():1657 >org.apache.drill.exec.work.fragment.FragmentExecutor.run():226 > org.apache.drill.common.SelfCleaningRunnable.run():38 > java.util.concurrent.ThreadPoolExecutor.runWorker():1142 > java.util.concurrent.ThreadPoolExecutor$Worker.run():617 > java.lang.Thread.run():745 (state=,code=0) > >* > > >On 12/16/2016 01:55 AM, Kathleen Li wrote: >> In my env, first time took about 1.6s, second time only took 0.5s >> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, >> actor_preferred_username from test where >> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= >> '2016-08-01T00.00.00.000Z' and posted_time >> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' >> group by actor_preferred_username order by >> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10; >> +------+---------------------------+ >> | cnt | actor_preferred_username | >> +------+---------------------------+ >> | 1 | _20_xxxx | >> +------+---------------------------+ >> 1 row selected (1.585 seconds) >> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, >> actor_preferred_username from test where >> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= >> '2016-08-01T00.00.00.000Z' and posted_time >> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' >> group by actor_preferred_username order by >> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10; >> +------+---------------------------+ >> | cnt | actor_preferred_username | >> +------+---------------------------+ >> | 1 | _20_xxxx | >> +------+---------------------------+ >> 1 row selected (0.505 seconds) >> >> >> >> >> >> >> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only >> using one fragment , see attached drill profile. >> >> As what Dechang suggested you can check profile and see the time mainly >> spent on which step. >> >> Kathleen >> >> >> On 12/15/16, 2:27 PM, "Dechang Gu" <d...@maprtech.com> wrote: >> >>> Yousuf, >>> Which version of drill are you running? >>> Can you share the profile of the query? >>> >>> Thanks, >>> Dechang >>> >>> >>> >>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yousufud...@css.org.sa> wrote: >>> >>>> Hello experts >>>> >>>> As a POC project, I've built a drill cluster on 5 VMs , each with the >>>> following specs >>>> >>>> 32 GB ram >>>> >>>> 1 TB storage >>>> >>>> 16 Cores >>>> >>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage >>>> engine is mongo which has 5 million docs. (Our daily collection is close to >>>> 2.5 million tweets) >>>> >>>> *Problem* >>>> >>>> My aggregation queries are slow, not fit for my realtime dashboard. sample >>>> query as follows took 12.+ secs, >>>> >>>> >>>> SELECT count(*) as cnt, actor_preferred_username from tweets where >>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time >>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by >>>> cnt desc limit 10; >>>> >>>> Pls. note that, >>>> >>>> *In Drill* >>>> >>>> 1) I've changed the conf/drill-env.sh on each node with following >>>> >>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export >>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}| >>>> >>>> 2) changed few setting based on the docs <https://drill.apache.org/docs >>>> /sort-based-and-hash-based-memory-constrained-operators/> >>>> >>>> *In Mongo* >>>> >>>> data is sharded on 3 servers on shardkey tweet_id which distributes data >>>> evenly on all shards and created compound index on fields used by the above >>>> drill query. >>>> >>>> >>>> sample data >>>> >>>> { >>>> "_id" : ObjectId("58524d507e08dae4c0377b9e"), >>>> "rule_list" : [ >>>> "A", >>>> "B", >>>> "C", >>>> "D13" >>>> ], >>>> "actor_friends_count" : 40, >>>> "klout_score" : 28, >>>> "actor_favorites_count" : 1697, >>>> "actor_preferred_username" : "_20_xxxx", >>>> "sentiment" : "neu", >>>> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx", >>>> "object_actor_followers_count" : 573, >>>> "actor_posted_time" : "2016-06-24T00:37:54.000Z", >>>> "actor_id" : "id:twitter.com:xxxxxxxxxx", >>>> "actor_display_name" : "xxxxxx", >>>> "retweet_count" : 57, >>>> "hashtag_list" : [ >>>> "c", >>>> "d", >>>> "E", >>>> "f", >>>> "VCVC", >>>> "XXCXCXC", >>>> "RRRR" >>>> ], >>>> "body" : "some tweet blah blah", >>>> "actor_followers_count" : 21, >>>> "actor_status_count" : 1756, >>>> "verb" : "share", >>>> "posted_time" : "2016-08-01T23:47:43.000Z", >>>> "object_actor_status_count" : 2796, >>>> "lang" : "ar", >>>> "object_actor_preferred_username" : "xxxxxxx", >>>> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx", >>>> "gender" : "male", >>>> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx", >>>> "favorites_count" : 0, >>>> "object_posted_time" : "2016-08-01T22:54:22.000Z", >>>> "object_actor_friends_count" : 69, >>>> "generator_display_name" : "Twitter Web Client", >>>> "object_actor_display_name" : "xxxxxxxxxxxx", >>>> "actor_listed_count" : 0 >>>> } >>>> >>>> >>>> >>>> *Questions* >>>> >>>> 1) How to improve aggregation query performance? >>>> >>>> 2) Do I also need to do something in mongodb to enhance performance? (I >>>> mean optimize source) >>>> >>>> 3) Does Apache drill capable of handling/aggregating billions of documents >>>> in real-time? >>>> >>>> Your early response is highly appreciated! >>>> >>>> Thank you & Kind Regards >>>> >>>> Yousuf >>>> >>>> >>>> >>>> >