It is supposed to work like you expected. May be you are running into a bug. Why is it reading all files after metadata refresh ? That is difficult to answer without looking at the logs and query profile. If you look at the query profile, you can may be check what usedMetadataFile flag says for scan. Also, I am thinking if you created so many files, your metadata cache file could be big. May be you can manually sanity check if it looks ok (look for .drill.parquet.metadata file in the root directory) and not corrupted ?
Thanks, Padma On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <kfar...@mapr.com<mailto:kfar...@mapr.com>> wrote: Please share your SQL query and the query plan. To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>; Thanks, Khurram ________________________________ From: Divya Gehlot <divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>> Sent: Friday, August 18, 2017 7:15:18 AM To: user@drill.apache.org<mailto:user@drill.apache.org> Subject: Re: Query Optimization Hi , Yes its the same query its just the ran the metadata refresh command . My understanding is metadata refresh command saves reading the metadata. How about column values ... Why is it reading all the files after metedata refresh ? Partition helps to retrieve data faster . Like in hive how it happens when you mention the partition column in where condition it just goes and read and improves the query performace . In my query also I where conidtion has partioning column it should go and read those partitioned files right ? Why is it taking more time ? Does the Drill works in different way compare to hive ? Thanks, Divya On 18 August 2017 at 07:37, Padma Penumarthy <ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>> wrote: It might read all those files if some new data gets added after running refresh metadata cache. If everything is same before and after metadata refresh i.e. no new data added and query is exactly the same, then it should not do that. Also, check if you can partition in a way that will not create so many files in the first place. Thanks, Padma On Aug 16, 2017, at 10:54 PM, Divya Gehlot <divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>> wrote: Hi, Another observation is My query had where conditions based on the partition values Total number of parquet files in directory - 102290 Before Metadata refresh - Its reading only 4 files After metadata refresh - its reading 102290 files This is how the refresh metadata works I mean it scans each and every files and get the results ? I dont have access to logs now . Thanks, Divya On 17 August 2017 at 13:48, Divya Gehlot <divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>> wrote: Hi, Another observation is My query had where conditions based on the partition values Before Metadata refresh - Its reading only 4 files After metadata refresh - its reading 102290 files Thanks, Divya On 17 August 2017 at 13:03, Padma Penumarthy <ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>> wrote: Does your query have partition filter ? Execution time is increased most likely because partition pruning is not happening. Did you get a chance to look at the logs ? That might give some clues. Thanks, Padma On Aug 16, 2017, at 9:32 PM, Divya Gehlot <divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>> wrote: Hi, Even I am surprised . I am running Drill version 1.10 on MapR enterprise version. *Query *- Selecting all the columns on partitioned parquet table I observed few things from Query statistics : Value Before Refresh Metadata After Refresh Metadata Fragments 1 13 DURATION 01 min 0.233 sec 18 min 0.744 sec PLANNING 59.818 sec 33.087 sec QUEUED Not Available Not Available EXECUTION 0.415 sec 17 min 27.657 sec The planning time is being reduced by approx 60% but the execution time increased drastically. I would like to understand why the exceution time increases after the metadata refresh . Appreciate the help. Thanks, divya On 17 August 2017 at 11:54, Padma Penumarthy <ppenumar...@mapr.com<mailto:ppenumar...@mapr.com>> wrote: Refresh table metadata should help reduce query planning time. It is odd that it went up after you did refresh table metadata. Did you check the logs to see what is happening ? You might have to turn on some debugs if needed. BTW, what version of Drill are you running ? Thanks, Padma On Aug 16, 2017, at 8:15 PM, Divya Gehlot <divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>> wrote: Hi, I have data in parquet file format . when I run the query the data and see the execution plan I could see following statistics TOTAL FRAGMENTS: 1 DURATION: 01 min 0.233 sec PLANNING: 59.818 sec QUEUED: Not Available EXECUTION: 0.415 sec As its a paquet file format I tried enabling refresh meta data and run below command REFRESH TABLE METADATA <path to table> ; then run the same query again on the same table same data (no changes in data) and could find the statistics as show below : TOTAL FRAGMENTS: 13 DURATION: 14 min 14.604 sec PLANNING: 33.087 sec QUEUED: Not Available EXECUTION: Not Available The query is still running . Can somebody help me understand why the query taking so long once I issue the refresh metadata command. Aprreciate the help ! Thanks, Divya