[ https://issues.apache.org/jira/browse/DRILL-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069961#comment-16069961 ]
ASF GitHub Bot commented on DRILL-4720: --------------------------------------- GitHub user arina-ielchiieva opened a pull request: https://github.com/apache/drill/pull/864 DRILL-4720: Fix SchemaPartitionExplorer.getSubPartitions method implementations to return only Drill file system directories 1. Added file system util helper classes to standardize list directory and file statuses usage in Drill with appropriate unit tests. 2. Fixed SchemaPartitionExplorer.getSubPartitions method implementations to return only directories that can be partitions according to Drill file system rules (excluded all files and directories that start with dot or underscore). 3. Added unit test for directory explorers UDFs with and without metadata cache presence. 4. Minor refactoring. Details in Jira [DRILL-4720](https://issues.apache.org/jira/browse/DRILL-4720). You can merge this pull request into a Git repository by running: $ git pull https://github.com/arina-ielchiieva/drill DRILL-4720 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/864.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #864 ---- commit 6d592373c740fd793ed6bbb3264b97b52e4b763b Author: Arina Ielchiieva <arina.yelchiy...@gmail.com> Date: 2017-06-29T13:08:33Z DRILL-4720: Fix SchemaPartitionExplorer.getSubPartitions method implementations to return only Drill file system directories 1. Added file system util helper classes to standardize list directory and file statuses usage in Drill with appropriate unit tests. 2. Fixed SchemaPartitionExplorer.getSubPartitions method implementations to return only directories that can be partitions according to Drill file system rules (excluded all files and directories that start with dot or underscore). 3. Added unit test for directory explorers UDFs with and without metadata cache presence. 4. Minor refactoring. ---- > MINDIR() and IMINDIR() functions return no results with metadata cache > ---------------------------------------------------------------------- > > Key: DRILL-4720 > URL: https://issues.apache.org/jira/browse/DRILL-4720 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 1.7.0 > Reporter: Krystal > Assignee: Arina Ielchiieva > > Parquet directories with meta data cache return 0 rows for MINDIR and IMINDIR > functions. > hadoop fs -ls /tmp/querylogs_4 > Found 6 items > -rwxr-xr-x 3 mapr mapr 15406 2016-06-13 10:18 > /tmp/querylogs_4/.drill.parquet_metadata > drwxr-xr-x - root root 4 2016-06-13 10:18 /tmp/querylogs_4/1985 > drwxr-xr-x - root root 3 2016-06-13 10:18 /tmp/querylogs_4/1999 > drwxr-xr-x - root root 3 2016-06-13 10:18 /tmp/querylogs_4/2005 > drwxr-xr-x - root root 4 2016-06-13 10:18 /tmp/querylogs_4/2014 > drwxr-xr-x - root root 6 2016-06-13 10:18 /tmp/querylogs_4/2016 > hadoop fs -ls /tmp/querylogs_4/1985 > Found 4 items > -rwxr-xr-x 3 mapr mapr 3634 2016-06-13 10:18 > /tmp/querylogs_4/1985/.drill.parquet_metadata > drwxr-xr-x - root root 2 2016-06-13 10:18 /tmp/querylogs_4/1985/Feb > drwxr-xr-x - root root 2 2016-06-13 10:18 /tmp/querylogs_4/1985/apr > drwxr-xr-x - root root 2 2016-06-13 10:18 > /tmp/querylogs_4/1985/jan > SELECT * FROM `dfs.tmp`.`querylogs_4` WHERE dir0 = > MINDIR('dfs.tmp','querylogs_4'); > +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+ > | voter_id | name | age | registration | contributions | voterzone | > date_time | dir0 | dir1 | dir2 | > +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+ > +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+ > No rows selected (0.803 seconds) > If the meta cache is removed, expected data is returned. > Here is the physical plan: > {code} > 00-00 Screen : rowType = RecordType(ANY *): rowcount = 3.75, cumulative > cost = {54.125 rows, 169.125 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = > 664191 > 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 3.75, > cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 network, 0.0 memory}, > id = 664190 > 00-02 Project(T51¦¦*=[$0]) : rowType = RecordType(ANY T51¦¦*): > rowcount = 3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 > network, 0.0 memory}, id = 664189 > 00-03 SelectionVectorRemover : rowType = RecordType(ANY T51¦¦*, ANY > dir0): rowcount = 3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, > 0.0 network, 0.0 memory}, id = 664188 > 00-04 Filter(condition=[=($1, '.drill.parquet_metadata')]) : > rowType = RecordType(ANY T51¦¦*, ANY dir0): rowcount = 3.75, cumulative cost > = {50.0 rows, 165.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664187 > 00-05 Project(T51¦¦*=[$0], dir0=[$1]) : rowType = RecordType(ANY > T51¦¦*, ANY dir0): rowcount = 25.0, cumulative cost = {25.0 rows, 50.0 cpu, > 0.0 io, 0.0 network, 0.0 memory}, id = 664186 > 00-06 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath > [path=/tmp/querylogs_4/2005/May/voter25.parquet/0_0_0.parquet]], > selectionRoot=/tmp/querylogs_4, numFiles=1, usedMetadataFile=true, > columns=[`*`]]]) : rowType = (DrillRecordRow[*, dir0]): rowcount = 25.0, > cumulative cost = {25.0 rows, 50.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id > = 664185 > {code} > Here is the plan for the same query against the same directory structure > without meta data cache: > {code} > 00-00 Screen : rowType = RecordType(ANY *): rowcount = 75.0, cumulative > cost = {82.5 rows, 157.5 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664312 > 00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0, > cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id > = 664311 > 00-02 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0, > cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id > = 664310 > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=maprfs:///tmp/querylogs_1/1985/Feb/voter10.parquet/0_0_0.parquet], > ReadEntryWithPath > [path=maprfs:///tmp/querylogs_1/1985/jan/voter5.parquet/0_0_0.parquet], > ReadEntryWithPath > [path=maprfs:///tmp/querylogs_1/1985/apr/voter65.parquet/0_0_0.parquet]], > selectionRoot=maprfs:/tmp/querylogs_1, numFiles=3, usedMetadataFile=false, > columns=[`*`]]]) : rowType = (DrillRecordRow[*, dir0]): rowcount = 75.0, > cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id > = 664309 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)