[ 
https://issues.apache.org/jira/browse/DRILL-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069951#comment-16069951
 ] 

Arina Ielchiieva edited comment on DRILL-4720 at 6/30/17 11:40 AM:
-------------------------------------------------------------------

In Drill code we frequently list directory and file statuses:
{{ShowFileHandler}} - lists all directories and files present in given path 
(including hidden)
{{FileSelection}} - lists all files recursively in given path excluding files 
and directories that start with dot or underscore
{{FileSystemSchemaFactory}} - lists all directories present in given path 
excluding directories that start with dot or underscore
{{WorkspaceSchemaFactory}} - 1. lists all directories present in given path 
excluding directories that start with dot or underscore 2. lists all files 
recursively in given path excluding files and directories that start with dot 
or underscore
{{FooterGatherer}} - lists all files in given path excluding files that start 
with dot or underscore
{{Metadata}} - 1. lists all directories and files in given path excluding files 
that start with dot or underscore  2. lists all files recursively in given path 
excluding files and directories that start with dot or underscore
{{ParquetFormatPlugin}} - lists all files in given path excluding files that 
start with dot or underscore
{{ParquetGroupScan}} - lists all files recursively in given path excluding 
files that start with dot or underscore
{{LocalPersistentStore}} - lists all files in given path that end with 
.sys.drill excluding files that start with dot or underscore

In many cases recursive search is implemented in each class, new instance of 
Drill filter is created (though it can be shared) etc.
Some use {{DrillFileSystem.list(boolean recursive, Path... paths)}} where as I 
have mentioned before logic is not obvious.

Common list status use cases:
1. List directories and / or files recursively or not.
2. List directories and / or files recursively or not applying drill file 
system filter.
3. List directories and / or files recursively or not applying custom filter.

To standardize list statuses usage in Drill I suggest we add two new helper 
classes which will hold all list status logic.
First one - {{FileSystemUtil}} which will have the following methods:
{{public static List<FileStatus> listDirectories(final FileSystem fs, Path 
path, boolean recursive, PathFilter... filters) throws IOException {}}
{{public static List<FileStatus> listFiles(FileSystem fs, Path path, boolean 
recursive, PathFilter... filters) throws IOException {}}
{{public static List<FileStatus> listAll(FileSystem fs, Path path, boolean 
recursive, PathFilter... filters) throws IOException {}}
We might add some other file system helper method later, that's why class name 
is quite abstract. Developer will be able yo use these methods to list statuses 
of directories, files or both, recursively or not and also will be able to 
apply custom filters if needed.

Second one - {{DrillFileSystemUtil}} which will have the same methods as 
{{FileSystemUtil}} but will also add Drill file system filter, so files and 
folders that start with dot and underscore is excluded.





was (Author: arina):
In Drill code we frequently list directory and file statuses:
{{ShowFileHandler}} - lists all directories and files present in given path 
(including hidden)
{{FileSelection}} - lists all files recursively in given path excluding files 
and directories that start with dot or underscore
{{FileSystemSchemaFactory}} - lists all directories present in given path 
excluding directories that start with dot or underscore
{{WorkspaceSchemaFactory}} - 1. lists all directories present in given path 
excluding directories that start with dot or underscore 2. lists all files 
recursively in given path excluding files and directories that start with dot 
or underscore
{{FooterGatherer}} - lists all files in given path excluding files that start 
with dot or underscore
{[Metadata}} - 1. lists all directories and files in given path excluding files 
that start with dot or underscore  2. lists all files recursively in given path 
excluding files and directories that start with dot or underscore
{{ParquetFormatPlugin}} - lists all files in given path excluding files that 
start with dot or underscore
{{ParquetGroupScan}} - lists all files recursively in given path excluding 
files that start with dot or underscore
{{LocalPersistentStore}} - lists all files in given path that end with 
.sys.drill excluding files that start with dot or underscore

In many cases recursive search is implemented in each class, new instance of 
Drill filter is created (though it can be shared) etc.
Some use {{DrillFileSystem.list(boolean recursive, Path... paths)}} where as I 
have mentioned before logic is not obvious.

Common list status use cases:
1. List directories and / or files recursively or not.
2. List directories and / or files recursively or not applying drill file 
system filter.
3. List directories and / or files recursively or not applying custom filter.

To standardize list statuses usage in Drill I suggest we add two new helper 
classes which will hold all list status logic.
First one - {{FileSystemUtil}} which will have the following methods:
{{public static List<FileStatus> listDirectories(final FileSystem fs, Path 
path, boolean recursive, PathFilter... filters) throws IOException {}}
{{public static List<FileStatus> listFiles(FileSystem fs, Path path, boolean 
recursive, PathFilter... filters) throws IOException {}}
{{public static List<FileStatus> listAll(FileSystem fs, Path path, boolean 
recursive, PathFilter... filters) throws IOException {}}
We might add some other file system helper method later, that's why class name 
is quite abstract. Developer will be able yo use these methods to list statuses 
of directories, files or both, recursively or not and also will be able to 
apply custom filters if needed.

Second one - {{DrillFileSystemUtil}} which will have the same methods as 
{{FileSystemUtil}} but will also add Drill file system filter, so files and 
folders that start with dot and underscore is excluded.




> MINDIR() and IMINDIR() functions return no results with metadata cache
> ----------------------------------------------------------------------
>
>                 Key: DRILL-4720
>                 URL: https://issues.apache.org/jira/browse/DRILL-4720
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.7.0
>            Reporter: Krystal
>            Assignee: Arina Ielchiieva
>
> Parquet directories with meta data cache return 0 rows for MINDIR and IMINDIR 
> functions.
> hadoop fs -ls /tmp/querylogs_4
> Found 6 items
> -rwxr-xr-x   3 mapr mapr      15406 2016-06-13 10:18 
> /tmp/querylogs_4/.drill.parquet_metadata
> drwxr-xr-x   - root root          4 2016-06-13 10:18 /tmp/querylogs_4/1985
> drwxr-xr-x   - root root          3 2016-06-13 10:18 /tmp/querylogs_4/1999
> drwxr-xr-x   - root root          3 2016-06-13 10:18 /tmp/querylogs_4/2005
> drwxr-xr-x   - root root          4 2016-06-13 10:18 /tmp/querylogs_4/2014
> drwxr-xr-x   - root root          6 2016-06-13 10:18 /tmp/querylogs_4/2016
> hadoop fs -ls /tmp/querylogs_4/1985
> Found 4 items
> -rwxr-xr-x   3 mapr mapr       3634 2016-06-13 10:18 
> /tmp/querylogs_4/1985/.drill.parquet_metadata
> drwxr-xr-x   - root root          2 2016-06-13 10:18 /tmp/querylogs_4/1985/Feb
> drwxr-xr-x   - root root          2 2016-06-13 10:18 /tmp/querylogs_4/1985/apr
> drwxr-xr-x   - root root          2 2016-06-13 10:18 
> /tmp/querylogs_4/1985/jan 
> SELECT * FROM `dfs.tmp`.`querylogs_4` WHERE dir0 = 
> MINDIR('dfs.tmp','querylogs_4');
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> | voter_id  | name  | age  | registration  | contributions  | voterzone  | 
> date_time  | dir0  | dir1  | dir2  |
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> No rows selected (0.803 seconds)
> If the meta cache is removed, expected data is returned.
> Here is the physical plan:
> {code}
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 3.75, cumulative 
> cost = {54.125 rows, 169.125 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 
> 664191
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 3.75, 
> cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 network, 0.0 memory}, 
> id = 664190
> 00-02        Project(T51¦¦*=[$0]) : rowType = RecordType(ANY T51¦¦*): 
> rowcount = 3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 664189
> 00-03          SelectionVectorRemover : rowType = RecordType(ANY T51¦¦*, ANY 
> dir0): rowcount = 3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 
> 0.0 network, 0.0 memory}, id = 664188
> 00-04            Filter(condition=[=($1, '.drill.parquet_metadata')]) : 
> rowType = RecordType(ANY T51¦¦*, ANY dir0): rowcount = 3.75, cumulative cost 
> = {50.0 rows, 165.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664187
> 00-05              Project(T51¦¦*=[$0], dir0=[$1]) : rowType = RecordType(ANY 
> T51¦¦*, ANY dir0): rowcount = 25.0, cumulative cost = {25.0 rows, 50.0 cpu, 
> 0.0 io, 0.0 network, 0.0 memory}, id = 664186
> 00-06                Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath 
> [path=/tmp/querylogs_4/2005/May/voter25.parquet/0_0_0.parquet]], 
> selectionRoot=/tmp/querylogs_4, numFiles=1, usedMetadataFile=true, 
> columns=[`*`]]]) : rowType = (DrillRecordRow[*, dir0]): rowcount = 25.0, 
> cumulative cost = {25.0 rows, 50.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
> = 664185
> {code}
> Here is the plan for the same query against the same directory structure 
> without meta data cache:
> {code}
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 75.0, cumulative 
> cost = {82.5 rows, 157.5 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664312
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0, 
> cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
> = 664311
> 00-02        Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0, 
> cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
> = 664310
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=maprfs:///tmp/querylogs_1/1985/Feb/voter10.parquet/0_0_0.parquet], 
> ReadEntryWithPath 
> [path=maprfs:///tmp/querylogs_1/1985/jan/voter5.parquet/0_0_0.parquet], 
> ReadEntryWithPath 
> [path=maprfs:///tmp/querylogs_1/1985/apr/voter65.parquet/0_0_0.parquet]], 
> selectionRoot=maprfs:/tmp/querylogs_1, numFiles=3, usedMetadataFile=false, 
> columns=[`*`]]]) : rowType = (DrillRecordRow[*, dir0]): rowcount = 75.0, 
> cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
> = 664309
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to