[ 
https://issues.apache.org/jira/browse/DRILL-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069961#comment-16069961
 ] 

ASF GitHub Bot commented on DRILL-4720:
---------------------------------------

GitHub user arina-ielchiieva opened a pull request:

    https://github.com/apache/drill/pull/864

    DRILL-4720: Fix SchemaPartitionExplorer.getSubPartitions method 
implementations to return only Drill file system directories

    1. Added file system util helper classes to standardize list directory and 
file statuses usage in Drill with appropriate unit tests.
    2. Fixed SchemaPartitionExplorer.getSubPartitions method implementations to 
return only directories that can be partitions according to Drill  file system 
rules (excluded all files and directories that start with dot or underscore).
    3. Added unit test for directory explorers UDFs with and without metadata 
cache presence.
    4. Minor refactoring.
    
    Details in Jira 
[DRILL-4720](https://issues.apache.org/jira/browse/DRILL-4720).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arina-ielchiieva/drill DRILL-4720

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/864.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #864
    
----
commit 6d592373c740fd793ed6bbb3264b97b52e4b763b
Author: Arina Ielchiieva <arina.yelchiy...@gmail.com>
Date:   2017-06-29T13:08:33Z

    DRILL-4720: Fix SchemaPartitionExplorer.getSubPartitions method 
implementations to return only Drill file system directories
    
    1. Added file system util helper classes to standardize list directory and 
file statuses usage in Drill with appropriate unit tests.
    2. Fixed SchemaPartitionExplorer.getSubPartitions method implementations to 
return only directories that can be partitions according to Drill file system 
rules
    (excluded all files and directories that start with dot or underscore).
    3. Added unit test for directory explorers UDFs with and without metadata 
cache presence.
    4. Minor refactoring.

----


> MINDIR() and IMINDIR() functions return no results with metadata cache
> ----------------------------------------------------------------------
>
>                 Key: DRILL-4720
>                 URL: https://issues.apache.org/jira/browse/DRILL-4720
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.7.0
>            Reporter: Krystal
>            Assignee: Arina Ielchiieva
>
> Parquet directories with meta data cache return 0 rows for MINDIR and IMINDIR 
> functions.
> hadoop fs -ls /tmp/querylogs_4
> Found 6 items
> -rwxr-xr-x   3 mapr mapr      15406 2016-06-13 10:18 
> /tmp/querylogs_4/.drill.parquet_metadata
> drwxr-xr-x   - root root          4 2016-06-13 10:18 /tmp/querylogs_4/1985
> drwxr-xr-x   - root root          3 2016-06-13 10:18 /tmp/querylogs_4/1999
> drwxr-xr-x   - root root          3 2016-06-13 10:18 /tmp/querylogs_4/2005
> drwxr-xr-x   - root root          4 2016-06-13 10:18 /tmp/querylogs_4/2014
> drwxr-xr-x   - root root          6 2016-06-13 10:18 /tmp/querylogs_4/2016
> hadoop fs -ls /tmp/querylogs_4/1985
> Found 4 items
> -rwxr-xr-x   3 mapr mapr       3634 2016-06-13 10:18 
> /tmp/querylogs_4/1985/.drill.parquet_metadata
> drwxr-xr-x   - root root          2 2016-06-13 10:18 /tmp/querylogs_4/1985/Feb
> drwxr-xr-x   - root root          2 2016-06-13 10:18 /tmp/querylogs_4/1985/apr
> drwxr-xr-x   - root root          2 2016-06-13 10:18 
> /tmp/querylogs_4/1985/jan 
> SELECT * FROM `dfs.tmp`.`querylogs_4` WHERE dir0 = 
> MINDIR('dfs.tmp','querylogs_4');
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> | voter_id  | name  | age  | registration  | contributions  | voterzone  | 
> date_time  | dir0  | dir1  | dir2  |
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> No rows selected (0.803 seconds)
> If the meta cache is removed, expected data is returned.
> Here is the physical plan:
> {code}
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 3.75, cumulative 
> cost = {54.125 rows, 169.125 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 
> 664191
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 3.75, 
> cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 network, 0.0 memory}, 
> id = 664190
> 00-02        Project(T51¦¦*=[$0]) : rowType = RecordType(ANY T51¦¦*): 
> rowcount = 3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 664189
> 00-03          SelectionVectorRemover : rowType = RecordType(ANY T51¦¦*, ANY 
> dir0): rowcount = 3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 
> 0.0 network, 0.0 memory}, id = 664188
> 00-04            Filter(condition=[=($1, '.drill.parquet_metadata')]) : 
> rowType = RecordType(ANY T51¦¦*, ANY dir0): rowcount = 3.75, cumulative cost 
> = {50.0 rows, 165.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664187
> 00-05              Project(T51¦¦*=[$0], dir0=[$1]) : rowType = RecordType(ANY 
> T51¦¦*, ANY dir0): rowcount = 25.0, cumulative cost = {25.0 rows, 50.0 cpu, 
> 0.0 io, 0.0 network, 0.0 memory}, id = 664186
> 00-06                Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath 
> [path=/tmp/querylogs_4/2005/May/voter25.parquet/0_0_0.parquet]], 
> selectionRoot=/tmp/querylogs_4, numFiles=1, usedMetadataFile=true, 
> columns=[`*`]]]) : rowType = (DrillRecordRow[*, dir0]): rowcount = 25.0, 
> cumulative cost = {25.0 rows, 50.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
> = 664185
> {code}
> Here is the plan for the same query against the same directory structure 
> without meta data cache:
> {code}
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 75.0, cumulative 
> cost = {82.5 rows, 157.5 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664312
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0, 
> cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
> = 664311
> 00-02        Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0, 
> cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
> = 664310
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=maprfs:///tmp/querylogs_1/1985/Feb/voter10.parquet/0_0_0.parquet], 
> ReadEntryWithPath 
> [path=maprfs:///tmp/querylogs_1/1985/jan/voter5.parquet/0_0_0.parquet], 
> ReadEntryWithPath 
> [path=maprfs:///tmp/querylogs_1/1985/apr/voter65.parquet/0_0_0.parquet]], 
> selectionRoot=maprfs:/tmp/querylogs_1, numFiles=3, usedMetadataFile=false, 
> columns=[`*`]]]) : rowType = (DrillRecordRow[*, dir0]): rowcount = 75.0, 
> cumulative cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
> = 664309
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to