[jira] [Created] (DRILL-6565) cume_dist does not return enough rows

Robert Hou (JIRA) Fri, 29 Jun 2018 13:51:29 -0700

Robert Hou created DRILL-6565:
---------------------------------

             Summary: cume_dist does not return enough rows
                 Key: DRILL-6565
                 URL: https://issues.apache.org/jira/browse/DRILL-6565
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Relational Operators
    Affects Versions: 1.14.0
            Reporter: Robert Hou
            Assignee: Pritesh Maker
         Attachments: drillbit.log.7802


This query should return 64 rows but only returns 38 rows:
alter session set `planner.width.max_per_node` = 1;
alter session set `planner.width.max_per_query` = 1;
select * from (
select cume_dist() over (order by Index) IntervalSecondValuea, Index from 
(select * from 
dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet` order 
by BigIntvalue)) d where d.Index = 1;

I tried to reproduce the problem by using a smaller table, but it does not 
reproduce.  I tried to reproduce the problem without the outside select 
statement, but it does not reproduce.

Here is the explain plan:
{noformat}
| 00-00    Screen : rowType = RecordType(DOUBLE IntervalSecondValuea, ANY 
Index): rowcount = 12000.0, cumulative cost = {757200.0 rows, 
1.1573335922911648E7 cpu, 0.0 io, 0.0 network, 1920000.0 memory}, id = 4034
00-01      ProjectAllowDup(IntervalSecondValuea=[$0], Index=[$1]) : rowType = 
RecordType(DOUBLE IntervalSecondValuea, ANY Index): rowcount = 12000.0, 
cumulative cost = {756000.0 rows, 1.1572135922911648E7 cpu, 0.0 io, 0.0 
network, 1920000.0 memory}, id = 4033
00-02        Project(w0$o0=[$1], $0=[$0]) : rowType = RecordType(DOUBLE w0$o0, 
ANY $0): rowcount = 12000.0, cumulative cost = {744000.0 rows, 
1.1548135922911648E7 cpu, 0.0 io, 0.0 network, 1920000.0 memory}, id = 4032
00-03          SelectionVectorRemover : rowType = RecordType(ANY $0, DOUBLE 
w0$o0): rowcount = 12000.0, cumulative cost = {732000.0 rows, 
1.1524135922911648E7 cpu, 0.0 io, 0.0 network, 1920000.0 memory}, id = 4031
00-04            Filter(condition=[=($0, 1)]) : rowType = RecordType(ANY $0, 
DOUBLE w0$o0): rowcount = 12000.0, cumulative cost = {720000.0 rows, 
1.1512135922911648E7 cpu, 0.0 io, 0.0 network, 1920000.0 memory}, id = 4030
00-05              Window(window#0=[window(partition {} order by [0] range 
between UNBOUNDED PRECEDING and CURRENT ROW aggs [CUME_DIST()])]) : rowType = 
RecordType(ANY $0, DOUBLE w0$o0): rowcount = 80000.0, cumulative cost = 
{640000.0 rows, 1.1144135922911648E7 cpu, 0.0 io, 0.0 network, 1920000.0 
memory}, id = 4029
00-06                SelectionVectorRemover : rowType = RecordType(ANY $0): 
rowcount = 80000.0, cumulative cost = {560000.0 rows, 1.0984135922911648E7 cpu, 
0.0 io, 0.0 network, 1920000.0 memory}, id = 4028
00-07                  Sort(sort0=[$0], dir0=[ASC]) : rowType = RecordType(ANY 
$0): rowcount = 80000.0, cumulative cost = {480000.0 rows, 1.0904135922911648E7 
cpu, 0.0 io, 0.0 network, 1920000.0 memory}, id = 4027
00-08                    Project($0=[ITEM($0, 'Index')]) : rowType = 
RecordType(ANY $0): rowcount = 80000.0, cumulative cost = {400000.0 rows, 
5692067.961455824 cpu, 0.0 io, 0.0 network, 1280000.0 memory}, id = 4026
00-09                      SelectionVectorRemover : rowType = 
RecordType(DYNAMIC_STAR T2¦¦**, ANY BigIntvalue): rowcount = 80000.0, 
cumulative cost = {320000.0 rows, 5612067.961455824 cpu, 0.0 io, 0.0 network, 
1280000.0 memory}, id = 4025
00-10                        Sort(sort0=[$1], dir0=[ASC]) : rowType = 
RecordType(DYNAMIC_STAR T2¦¦**, ANY BigIntvalue): rowcount = 80000.0, 
cumulative cost = {240000.0 rows, 5532067.961455824 cpu, 0.0 io, 0.0 network, 
1280000.0 memory}, id = 4024
00-11                          Project(T2¦¦**=[$0], BigIntvalue=[$1]) : rowType 
= RecordType(DYNAMIC_STAR T2¦¦**, ANY BigIntvalue): rowcount = 80000.0, 
cumulative cost = {160000.0 rows, 320000.0 cpu, 0.0 io, 0.0 network, 0.0 
memory}, id = 4023
00-12                            Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath 
[path=maprfs:///drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet]],
 
selectionRoot=maprfs:/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet,
 numFiles=1, numRowGroups=6, usedMetadataFile=false, columns=[`**`]]]) : 
rowType = RecordType(DYNAMIC_STAR **, ANY BigIntvalue): rowcount = 80000.0, 
cumulative cost = {80000.0 rows, 160000.0 cpu, 0.0 io, 0.0 network, 0.0 
memory}, id = 4022
{noformat}

I have attached the drillbit.log.

The commit id is:
| 1.14.0-SNAPSHOT  | aa127b70b1e46f7f4aa19881f25eda583627830a  | DRILL-6523: 
Fix NPE for describe of partial schema  | 22.06.2018 @ 11:28:23 PDT  | 
r...@mapr.com  | 23.06.2018 @ 02:05:10 PDT  |

fourvarchar_asc_nulls95.q



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (DRILL-6565) cume_dist does not return enough rows

Reply via email to