[ https://issues.apache.org/jira/browse/DRILL-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096238#comment-16096238 ]
Arina Ielchiieva commented on DRILL-4735: ----------------------------------------- During implementation some details has changed: 1. It turned out that we can get access to session options directly in {{ConvertCountToDirectScan}} using {{PrelUtil.getPlannerSettings(call.getPlanner())}} so now there is no need to pass {{OptimizerRulesContext}} to {{ConvertCountToDirectScan}}. We will skip applying this rule if directory column is present in selection, on the contrary for implicit columns, we'll set count result to total records count since, they are based on the files and there is no data without a file. Also there has been done some refactoring in {{ConvertCountToDirectScan}}, counts collection logic was encapsulated in {{CountsCollector}} class which is a helper class. 2. We still introduced {{DynamicPojoRecordReader}} class but it would accept two parameters. First schema represented by {{LinkedHashMap<String, Class<?>>}} and second records itself represented by {{List<List<T>>}}. We force user to pass schema to cover the case when there is no records to be read but we still need schema to proceed. If records of the same type, user may set {{T}} to that very type, if records contains different types, {{T}} should be set to {{Object}}. 3. {{MetadataDirectGroupScan}} string representation now includes also number of files: {noformat} [usedMetadata = true, files = [/tpch/nation.parquet], numFiles = 1] {noformat} > Count(dir0) on parquet returns 0 result > --------------------------------------- > > Key: DRILL-4735 > URL: https://issues.apache.org/jira/browse/DRILL-4735 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization, Storage - Parquet > Affects Versions: 1.0.0, 1.4.0, 1.6.0, 1.7.0 > Reporter: Krystal > Assignee: Arina Ielchiieva > Priority: Critical > > Selecting a count of dir0, dir1, etc against a parquet directory returns 0 > rows. > select count(dir0) from `min_max_dir`; > +---------+ > | EXPR$0 | > +---------+ > | 0 | > +---------+ > select count(dir1) from `min_max_dir`; > +---------+ > | EXPR$0 | > +---------+ > | 0 | > +---------+ > If I put both dir0 and dir1 in the same select, it returns expected result: > select count(dir0), count(dir1) from `min_max_dir`; > +---------+---------+ > | EXPR$0 | EXPR$1 | > +---------+---------+ > | 600 | 600 | > +---------+---------+ > Here is the physical plan for count(dir0) query: > {code} > 00-00 Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0, > cumulative cost = {22.0 rows, 22.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id > = 1346 > 00-01 Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): > rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, > 0.0 memory}, id = 1345 > 00-02 Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): > rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, > 0.0 memory}, id = 1344 > 00-03 > Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@3da85d3b[columns > = null, isStarQuery = false, isSkipQuery = false]]) : rowType = > RecordType(BIGINT count): rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 > cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1343 > {code} > Here is part of the explain plan for the count(dir0) and count(dir1) in the > same select: > {code} > 00-00 Screen : rowType = RecordType(BIGINT EXPR$0, BIGINT EXPR$1): > rowcount = 60.0, cumulative cost = {1206.0 rows, 15606.0 cpu, 0.0 io, 0.0 > network, 0.0 memory}, id = 1623 > 00-01 Project(EXPR$0=[$0], EXPR$1=[$1]) : rowType = RecordType(BIGINT > EXPR$0, BIGINT EXPR$1): rowcount = 60.0, cumulative cost = {1200.0 rows, > 15600.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1622 > 00-02 StreamAgg(group=[{}], EXPR$0=[COUNT($0)], EXPR$1=[COUNT($1)]) : > rowType = RecordType(BIGINT EXPR$0, BIGINT EXPR$1): rowcount = 60.0, > cumulative cost = {1200.0 rows, 15600.0 cpu, 0.0 io, 0.0 network, 0.0 > memory}, id = 1621 > 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath > [path=maprfs:/drill/testdata/min_max_dir/1999/Apr/voter20.parquet/0_0_0.parquet], > ReadEntryWithPath > [path=maprfs:/drill/testdata/min_max_dir/1999/MAR/voter15.parquet/0_0_0.parquet], > ReadEntryWithPath > [path=maprfs:/drill/testdata/min_max_dir/1985/jan/voter5.parquet/0_0_0.parquet], > ReadEntryWithPath > [path=maprfs:/drill/testdata/min_max_dir/1985/apr/voter60.parquet/0_0_0.parquet],..., > ReadEntryWithPath > [path=maprfs:/drill/testdata/min_max_dir/2014/jul/voter35.parquet/0_0_0.parquet]], > selectionRoot=maprfs:/drill/testdata/min_max_dir, numFiles=16, > usedMetadataFile=false, columns=[`dir0`, `dir1`]]]) : rowType = > RecordType(ANY dir0, ANY dir1): rowcount = 600.0, cumulative cost = {600.0 > rows, 1200.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1620 > {code} > Notice that in the first case, > "org.apache.drill.exec.store.pojo.PojoRecordReader" is used. -- This message was sent by Atlassian JIRA (v6.4.14#64029)