[jira] [Commented] (DRILL-6949) Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition the inner data any further" when Semi join is enabled

2019-11-05 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968050#comment-16968050
 ] 

Boaz Ben-Zvi commented on DRILL-6949:
-

Tried adding this code (in onMatch() in HashJoinPrule.java) to use stats and 
make the planning decision:
{code:java}
/*
 * For semi-join: When there are too many key-duplicates on the right side 
(i.e., more than %50, based on stats)
 * use the old style plan (with a hash-aggr below the build side to eliminate 
the duplicates)
 * (Else the hash-join memory would be overloaded with those duplicate entries)
 */
if ( isSemi &&
 settings.isStatisticsEnabled() && // only when stats is enabled
 call.getMetadataQuery().getRowCount(right) >
 2 * call.getMetadataQuery().getDistinctRowCount(right, 
ImmutableBitSet.builder().addAll(join.getRightKeys()).build(), null) ) {
 return;
}{code}
However the following test failed (to restore the Hash-Aggr):

 
{code:java}
analyze table lineitem compute statistics;
SET `planner.statistics.use` = true;
select count(*) from lineitem T1 where T1.l_discount in (select 
distinct(cast(T2.l_discount as double)) from lineitem T2);
{code}
 

As apparently the *cast* "confused" the code that gets the number of distinct 
rows (which returned a much higher number).

BTW, trying this query without a cast fails to plan:

 
{code:java}
apache drill (dfs.tmp)> select count(*) from lineitem T1 where T1.l_discount in 
(select T2.l_discount from lineitem T2);
Error: SYSTEM ERROR: CannotPlanException: There are not enough rules to produce 
a node with desired properties: convention=PHYSICAL, 
DrillDistributionTraitDef=SINGLETON([]), sort=[].
Missing conversion is DrillAggregateRel[convention: LOGICAL -> PHYSICAL, 
DrillDistributionTraitDef: ANY([]) -> SINGLETON([])]
There is 1 empty subset: rel#1710:Subset#19.PHYSICAL.SINGLETON([]).[], the 
relevant part of the original plan is as follows
1702:DrillAggregateRel(group=[{}], EXPR$0=[COUNT()])
  1699:DrillProjectRel(subset=[rel#1700:Subset#18.LOGICAL.ANY([]).[]], $f0=[0])
    1697:DrillSemiJoinRel(subset=[rel#1698:Subset#17.LOGICAL.ANY([]).[]], 
condition=[=($0, $1)], joinType=[semi])
      1556:DrillScanRel(subset=[rel#1696:Subset#16.LOGICAL.ANY([]).[]], 
table=[[dfs, tmp, lineitem]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=file:/tmp/lineitem]], 
selectionRoot=file:/tmp/lineitem, numFiles=1, numRowGroups=3, 
usedMetadataFile=false, columns=[`l_discount`]]])
      1556:DrillScanRel(subset=[rel#1696:Subset#16.LOGICAL.ANY([]).[]], 
table=[[dfs, tmp, lineitem]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=file:/tmp/lineitem]], 
selectionRoot=file:/tmp/lineitem, numFiles=1, numRowGroups=3, 
usedMetadataFile=false, columns=[`l_discount`]]])
{code}
 

 

> Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition 
> the inner data any further" when Semi join is enabled
> 
>
> Key: DRILL-6949
> URL: https://issues.apache.org/jira/browse/DRILL-6949
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.15.0
>Reporter: Abhishek Ravi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: Future
>
> Attachments: 23cc1240-74ff-a0c0-8cd5-938fc136e4e2.sys.drill, 
> 23cc1369-0812-63ce-1861-872636571437.sys.drill
>
>
> Following query fails when with *Error: UNSUPPORTED_OPERATION ERROR: 
> Hash-Join can not partition the inner data any further (probably due to too 
> many join-key duplicates)* on TPC-H SF100 data.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 1;
> set `planner.enable_broadcast_join` = false;
> select
>  count(*)
> from
>  lineitem l1
> where
>  l1.l_discount IN (
>  select
>  distinct(cast(l2.l_discount as double))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> {code}
> The subquery contains *distinct* keyword and hence there should not be 
> duplicate values. 
> I suspect that the failure is caused by semijoin because the query succeeds 
> when semijoin is disabled explicitly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6938) SQL get the wrong result after hashjoin and hashagg disabled

2019-11-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967220#comment-16967220
 ] 

Boaz Ben-Zvi commented on DRILL-6938:
-

[~dony.dong] - can this case be closed ?  Looks like it only affected 1.13 

 

> SQL get the wrong result after hashjoin and hashagg disabled
> 
>
> Key: DRILL-6938
> URL: https://issues.apache.org/jira/browse/DRILL-6938
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Dony Dong
>Assignee: Boaz Ben-Zvi
>Priority: Critical
>
> Hi Team
> After we disable hashjoin and hashagg to fix out of memory issue, we got the 
> wrong result.
> With these two parameters enabled, we will get 8 rows. After we disable them, 
> it only return 3 rows. It seems some MEM_ID had exclude before group or some 
> other step.
> select b.MEM_ID,count(distinct b.DEP_NO)
> from dfs.test.emp b
> where b.DEP_NO<>'-'
> and b.MEM_ID in ('68','412','852','117','657','816','135','751')
> and b.HIRE_DATE>'2014-06-01'
> group by b.MEM_ID
> order by 1;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6949) Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition the inner data any further" when Semi join is enabled

2019-11-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967217#comment-16967217
 ] 

Boaz Ben-Zvi commented on DRILL-6949:
-

DRILL-6845 has a PR (1606) that tries to solve this problem at runtime by 
eliminating duplicates (though some overhead is added).

 

> Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition 
> the inner data any further" when Semi join is enabled
> 
>
> Key: DRILL-6949
> URL: https://issues.apache.org/jira/browse/DRILL-6949
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.15.0
>Reporter: Abhishek Ravi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: Future
>
> Attachments: 23cc1240-74ff-a0c0-8cd5-938fc136e4e2.sys.drill, 
> 23cc1369-0812-63ce-1861-872636571437.sys.drill
>
>
> Following query fails when with *Error: UNSUPPORTED_OPERATION ERROR: 
> Hash-Join can not partition the inner data any further (probably due to too 
> many join-key duplicates)* on TPC-H SF100 data.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 1;
> set `planner.enable_broadcast_join` = false;
> select
>  count(*)
> from
>  lineitem l1
> where
>  l1.l_discount IN (
>  select
>  distinct(cast(l2.l_discount as double))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> {code}
> The subquery contains *distinct* keyword and hence there should not be 
> duplicate values. 
> I suspect that the failure is caused by semijoin because the query succeeds 
> when semijoin is disabled explicitly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6845) Eliminate duplicates for Semi Hash Join

2019-11-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967216#comment-16967216
 ] 

Boaz Ben-Zvi commented on DRILL-6845:
-

The same problem can also be addressed during plan time with the use of 
statistics; see also DRILL-6949 

> Eliminate duplicates for Semi Hash Join
> ---
>
> Key: DRILL-6845
> URL: https://issues.apache.org/jira/browse/DRILL-6845
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
>
> Following DRILL-6735: The performance of the new Semi Hash Join may degrade 
> if the build side contains excessive number of join-key duplicate rows; this 
> mainly a result of the need to store all those rows first, before the hash 
> table is built.
>   Proposed solution: For Semi, the Hash Agg would create a Hash-Table 
> initially, and use it to eliminate key-duplicate rows as they arrive.
>   Proposed extra: That Hash-Table has an added cost (e.g. resizing). So 
> perform "runtime stats" – Check initial number of incoming rows (e.g. 32k), 
> and if the number of duplicates is less than some threshold (e.g. %20) – 
> cancel that "early" hash table.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-4667) Improve memory footprint of broadcast joins

2019-11-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967215#comment-16967215
 ] 

Boaz Ben-Zvi commented on DRILL-4667:
-

Need to implement some kind of a *shared memory* for multiple instances of the 
operator (in multiple minor fragments) to use, as well as coordinate (which one 
builds it, when it is ready, when no longer needed, who can deallocate it).

> Improve memory footprint of broadcast joins
> ---
>
> Key: DRILL-4667
> URL: https://issues.apache.org/jira/browse/DRILL-4667
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.6.0
>Reporter: Aman Sinha
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.18.0
>
>
> For broadcast joins, currently Drill optimizes the data transfer across the 
> network for broadcast table by sending a single copy to the receiving node 
> which then distributes it to all minor fragments running on that particular 
> node.  However, each minor fragment builds its own hash table (for a hash 
> join) using this broadcast table.  We can substantially improve the memory 
> footprint by having a shared copy of the hash table among multiple minor 
> fragments on a node.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-4667) Improve memory footprint of broadcast joins

2019-11-04 Thread Boaz Ben-Zvi (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-4667:

Fix Version/s: 1.18.0

> Improve memory footprint of broadcast joins
> ---
>
> Key: DRILL-4667
> URL: https://issues.apache.org/jira/browse/DRILL-4667
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.6.0
>Reporter: Aman Sinha
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.18.0
>
>
> For broadcast joins, currently Drill optimizes the data transfer across the 
> network for broadcast table by sending a single copy to the receiving node 
> which then distributes it to all minor fragments running on that particular 
> node.  However, each minor fragment builds its own hash table (for a hash 
> join) using this broadcast table.  We can substantially improve the memory 
> footprint by having a shared copy of the hash table among multiple minor 
> fragments on a node.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7141) Hash-Join (and Agg) should always spill to disk the least used partition

2019-11-04 Thread Boaz Ben-Zvi (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7141:

Fix Version/s: 1.18.0

> Hash-Join (and Agg) should always spill to disk the least used partition
> 
>
> Key: DRILL-7141
> URL: https://issues.apache.org/jira/browse/DRILL-7141
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.15.0
>Reporter: Kunal Khatua
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.18.0
>
>
> When the probe-side data for a hash join is skewed, it is preferable to have 
> the corresponding partition on the build side to be in memory. 
> Currently, with the spill-to-disk feature, the partition selected for 
> spilling to disk is done at random. This means that a highly skewed 
> probe-side data would also spill for lack of a corresponding hash table 
> partition in memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7141) Hash-Join (and Agg) should always spill to disk the least used partition

2019-11-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967208#comment-16967208
 ] 

Boaz Ben-Zvi commented on DRILL-7141:
-

This enhancement requires the use of good statistics (on the probe-side), as 
spilling decisions (i.e., which partitions) happen during the build phase.

 

> Hash-Join (and Agg) should always spill to disk the least used partition
> 
>
> Key: DRILL-7141
> URL: https://issues.apache.org/jira/browse/DRILL-7141
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.15.0
>Reporter: Kunal Khatua
>Assignee: Boaz Ben-Zvi
>Priority: Major
>
> When the probe-side data for a hash join is skewed, it is preferable to have 
> the corresponding partition on the build side to be in memory. 
> Currently, with the spill-to-disk feature, the partition selected for 
> spilling to disk is done at random. This means that a highly skewed 
> probe-side data would also spill for lack of a corresponding hash table 
> partition in memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-6836) Eliminate StreamingAggr for COUNT DISTINCT

2019-11-04 Thread Boaz Ben-Zvi (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6836:

Fix Version/s: Future

> Eliminate StreamingAggr for COUNT DISTINCT
> --
>
> Key: DRILL-6836
> URL: https://issues.apache.org/jira/browse/DRILL-6836
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: Future
>
>
> The COUNT DISTINCT operation is often implemented with a Hash-Aggr operator 
> for the DISTINCT, and a Streaming-Aggr above to perform the COUNT.  That 
> Streaming-Aggr does the counting like any aggregation, counting each value, 
> batch after batch.
>   While very efficient, that counting work is basically not needed, as the 
> Hash-Aggr knows the number of distinct values (in the in-memory partitions).
>   Hence _a possible small performance improvement_ - eliminate the 
> Streaming-Aggr operator, and notify the Hash-Aggr to return a COUNT (these 
> are Planner changes). The Hash-Aggr operator would need to generate the 
> single Float8 column output schema, and output that batch with a single 
> value, just like the Streaming -Aggr did (likely without generating code).
>   In case of a spill, the Hash-Aggr still needs to read and process those 
> partitions, to get the exact distinct number.
>    The expected improvement is the elimination of the batch by batch output 
> from the Hash-Aggr, and the batch by batch, row by row processing of the 
> Streaming-Aggr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-6799) Enhance the Hash-Join Operator to perform Anti-Semi-Join

2019-11-04 Thread Boaz Ben-Zvi (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6799:

Fix Version/s: Future

> Enhance the Hash-Join Operator to perform Anti-Semi-Join
> 
>
> Key: DRILL-6799
> URL: https://issues.apache.org/jira/browse/DRILL-6799
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: Future
>
>
> Similar to handling Semi-Join (see DRILL-6735), the Anti-Semi-Join can be 
> enhanced by eliminating the extra DISTINCT (i.e. Hash-Aggr) operator.
> Example (note the NOT IN):
> select c.c_first_name, c.c_last_name from dfs.`/data/json/s1/customer` c 
> where c.c_customer_sk NOT IN (select s.ss_customer_sk from 
> dfs.`/data/json/s1/store_sales` s) limit 4;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7136) Num_buckets for HashAgg in profile may be inaccurate

2019-11-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967202#comment-16967202
 ] 

Boaz Ben-Zvi commented on DRILL-7136:
-

When a Hash-Aggr partition is spilled, its hash-table is reset (i.e. 
reallocated at the default size of 64K), but the prior number of times resizing 
happened is left as is, as well as the resizing time; hence these stats show 
the total (across possible multiple iterations of reset-build-spill). So when 
the stats are reported, they show the *current* hash-table size, and the 
*accumulated* resizing stats. (Hash-Join does not have this issue, as the 
hash-table is built only when the partition is whole in memory).

[~rhou] - should these stats be reported differently ?

 

> Num_buckets for HashAgg in profile may be inaccurate
> 
>
> Key: DRILL-7136
> URL: https://issues.apache.org/jira/browse/DRILL-7136
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Tools, Build  Test
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Attachments: 23650ee5-6721-8a8f-7dd3-f5dd09a3a7b0.sys.drill
>
>
> I ran TPCH query 17 with sf 1000.  Here is the query:
> {noformat}
> select
>   sum(l.l_extendedprice) / 7.0 as avg_yearly
> from
>   lineitem l,
>   part p
> where
>   p.p_partkey = l.l_partkey
>   and p.p_brand = 'Brand#13'
>   and p.p_container = 'JUMBO CAN'
>   and l.l_quantity < (
> select
>   0.2 * avg(l2.l_quantity)
> from
>   lineitem l2
> where
>   l2.l_partkey = p.p_partkey
>   );
> {noformat}
> One of the hash agg operators has resized 6 times.  It should have 4M 
> buckets.  But the profile shows it has 64K buckets.
> I have attached a sample profile.  In this profile, the hash agg operator is 
> (04-02).
> {noformat}
> Operator Metrics
> Minor FragmentNUM_BUCKETS NUM_ENTRIES NUM_RESIZING
> RESIZING_TIME_MSNUM_PARTITIONS  SPILLED_PARTITIONS  SPILL_MB  
>   SPILL_CYCLE INPUT_BATCH_COUNT   AVG_INPUT_BATCH_BYTES   
> AVG_INPUT_ROW_BYTES INPUT_RECORD_COUNT  OUTPUT_BATCH_COUNT  
> AVG_OUTPUT_BATCH_BYTES  AVG_OUTPUT_ROW_BYTESOUTPUT_RECORD_COUNT
> 04-00-02  65,536 748,746  6   364 1   
> 582 0   813 582,653 18  26,316,456  401 1,631,943 
>   25  26,176,350
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-6758) Hash Join should not return the join columns when they are not needed downstream

2019-11-04 Thread Boaz Ben-Zvi (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6758:

Fix Version/s: 1.18.0

> Hash Join should not return the join columns when they are not needed 
> downstream
> 
>
> Key: DRILL-6758
> URL: https://issues.apache.org/jira/browse/DRILL-6758
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Hanumath Rao Maduri
>Priority: Minor
> Fix For: 1.18.0
>
>
> Currently the Hash-Join operator returns all its (both sides) incoming 
> columns. In cases where the join columns are not used further downstream, 
> this is a waste (allocating vectors, copying each value, etc).
>   Suggestion: Have the planner pass this information to the Hash-Join 
> operator, to enable skipping the return of these columns.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6767) Simplify transfer of information from the planner to the operators

2019-11-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967173#comment-16967173
 ] 

Boaz Ben-Zvi commented on DRILL-6767:
-

Postpone the work as it does not give an immediate benefit (maybe can be done 
along with another work mentioned above, like broadcast hash join, or skip 
unneeded join columns).

 

> Simplify transfer of information from the planner to the operators
> --
>
> Key: DRILL-6767
> URL: https://issues.apache.org/jira/browse/DRILL-6767
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: Future
>
>
> Currently little specific information known to the planner is passed to the 
> operators. For example, see the `joinType` parameter passed to the Join 
> operators (specifying whether this is a LEFT, RIGHT, INNER of FULL join). 
>  The relevant code passes this information explicitly via the constructors' 
> signature (e.g., see HashJoinPOP, AbstractJoinPop, etc), and uses specific 
> fields for this information, and affects all the test code using it, etc.
>  In the near future many more such "pieces of information" will possibly be 
> added to Drill, including:
>  (1) Is this a Semi (or Anti-Semi) join.
>  (2) `joinControl`
>  (3) `isRowKeyJoin`
>  (4) `isBroadcastJoin`
>  (5) Which join columns are not needed (DRILL-6758)
>  (6) Is this operator positioned between Lateral and UnNest.
>  (7) For Hash-Agg: Which phase (already implemented).
>  (8) For Hash-Agg: Perform COUNT  (DRILL-6836) 
> Each addition of such information would require a significant code change, 
> and add some code clutter.
> *Suggestion*: Instead pass a single object containing all the needed planner 
> information. So the next time another field is added, only that object needs 
> to be changed. (Ideally the whole plan could be passed, and then each 
> operator could poke and pick its needed fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-6767) Simplify transfer of information from the planner to the operators

2019-11-04 Thread Boaz Ben-Zvi (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6767:

Fix Version/s: Future

> Simplify transfer of information from the planner to the operators
> --
>
> Key: DRILL-6767
> URL: https://issues.apache.org/jira/browse/DRILL-6767
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: Future
>
>
> Currently little specific information known to the planner is passed to the 
> operators. For example, see the `joinType` parameter passed to the Join 
> operators (specifying whether this is a LEFT, RIGHT, INNER of FULL join). 
>  The relevant code passes this information explicitly via the constructors' 
> signature (e.g., see HashJoinPOP, AbstractJoinPop, etc), and uses specific 
> fields for this information, and affects all the test code using it, etc.
>  In the near future many more such "pieces of information" will possibly be 
> added to Drill, including:
>  (1) Is this a Semi (or Anti-Semi) join.
>  (2) `joinControl`
>  (3) `isRowKeyJoin`
>  (4) `isBroadcastJoin`
>  (5) Which join columns are not needed (DRILL-6758)
>  (6) Is this operator positioned between Lateral and UnNest.
>  (7) For Hash-Agg: Which phase (already implemented).
>  (8) For Hash-Agg: Perform COUNT  (DRILL-6836) 
> Each addition of such information would require a significant code change, 
> and add some code clutter.
> *Suggestion*: Instead pass a single object containing all the needed planner 
> information. So the next time another field is added, only that object needs 
> to be changed. (Ideally the whole plan could be passed, and then each 
> operator could poke and pick its needed fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6949) Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition the inner data any further" when Semi join is enabled

2019-11-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967171#comment-16967171
 ] 

Boaz Ben-Zvi commented on DRILL-6949:
-

[~gparai] - note potential use of statistics ; e.g., if number of distinct keys 
is less than 1/2 of the number of keys - disable semijoin.

 

> Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition 
> the inner data any further" when Semi join is enabled
> 
>
> Key: DRILL-6949
> URL: https://issues.apache.org/jira/browse/DRILL-6949
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.15.0
>Reporter: Abhishek Ravi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: Future
>
> Attachments: 23cc1240-74ff-a0c0-8cd5-938fc136e4e2.sys.drill, 
> 23cc1369-0812-63ce-1861-872636571437.sys.drill
>
>
> Following query fails when with *Error: UNSUPPORTED_OPERATION ERROR: 
> Hash-Join can not partition the inner data any further (probably due to too 
> many join-key duplicates)* on TPC-H SF100 data.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 1;
> set `planner.enable_broadcast_join` = false;
> select
>  count(*)
> from
>  lineitem l1
> where
>  l1.l_discount IN (
>  select
>  distinct(cast(l2.l_discount as double))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> {code}
> The subquery contains *distinct* keyword and hence there should not be 
> duplicate values. 
> I suspect that the failure is caused by semijoin because the query succeeds 
> when semijoin is disabled explicitly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6949) Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition the inner data any further" when Semi join is enabled

2019-11-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967156#comment-16967156
 ] 

Boaz Ben-Zvi commented on DRILL-6949:
-

The two profiles clearly show that this is a fundamental issue: The are only 
561 distinct values of "l_discount" out of 600M rows. The new "efficient 
semi-join" was designed based on the assumption that the build side has a 
relatively small number of duplicates.  The new implementation saves the cost 
of the hash-aggregation, but then keeps *ALL* the build side rows . In a case 
like this query, the build side of the hash-join balloons to a huge size, which 
causes a huge spilling, which leads to secondary/tertiary/... spillings that 
try to subdivide the partitions by keys, but those keys are duplicates hence 
the subdivision does not help (as the error message explains).

Such an extreme situation is not expected in real-world queries. The only 
workaround for such unusual query is (as [~agirish] suggested) is to disable 
the new semi-join (hence the Hash-Aggregate would eliminate the duplicates).

Another far future solution is to use statistics in the planar to detect such a 
case, and then the planner would disable the new semi-join automatically.

 

> Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition 
> the inner data any further" when Semi join is enabled
> 
>
> Key: DRILL-6949
> URL: https://issues.apache.org/jira/browse/DRILL-6949
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.15.0
>Reporter: Abhishek Ravi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Attachments: 23cc1240-74ff-a0c0-8cd5-938fc136e4e2.sys.drill, 
> 23cc1369-0812-63ce-1861-872636571437.sys.drill
>
>
> Following query fails when with *Error: UNSUPPORTED_OPERATION ERROR: 
> Hash-Join can not partition the inner data any further (probably due to too 
> many join-key duplicates)* on TPC-H SF100 data.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 1;
> set `planner.enable_broadcast_join` = false;
> select
>  count(*)
> from
>  lineitem l1
> where
>  l1.l_discount IN (
>  select
>  distinct(cast(l2.l_discount as double))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> {code}
> The subquery contains *distinct* keyword and hence there should not be 
> duplicate values. 
> I suspect that the failure is caused by semijoin because the query succeeds 
> when semijoin is disabled explicitly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-6949) Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition the inner data any further" when Semi join is enabled

2019-11-04 Thread Boaz Ben-Zvi (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6949:

Fix Version/s: Future

> Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition 
> the inner data any further" when Semi join is enabled
> 
>
> Key: DRILL-6949
> URL: https://issues.apache.org/jira/browse/DRILL-6949
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.15.0
>Reporter: Abhishek Ravi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: Future
>
> Attachments: 23cc1240-74ff-a0c0-8cd5-938fc136e4e2.sys.drill, 
> 23cc1369-0812-63ce-1861-872636571437.sys.drill
>
>
> Following query fails when with *Error: UNSUPPORTED_OPERATION ERROR: 
> Hash-Join can not partition the inner data any further (probably due to too 
> many join-key duplicates)* on TPC-H SF100 data.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 1;
> set `planner.enable_broadcast_join` = false;
> select
>  count(*)
> from
>  lineitem l1
> where
>  l1.l_discount IN (
>  select
>  distinct(cast(l2.l_discount as double))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> {code}
> The subquery contains *distinct* keyword and hence there should not be 
> duplicate values. 
> I suspect that the failure is caused by semijoin because the query succeeds 
> when semijoin is disabled explicitly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7244) Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-11-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967118#comment-16967118
 ] 

Boaz Ben-Zvi commented on DRILL-7244:
-

I just retried the original example (see DRILL-7240) with recent code (Oct 20~) 
and a debugger, and a breakpoint was indeed hit inside the "catch 
(ClassCastException cce)" clause (line 204 - 
AbstractParquetScanBatchCreator.java), and the log does note this event 
correctly:
{noformat}
2019-11-04 16:23:15,378 [223f3f7f-bcd5-8632-94e9-842495cdfd7d:frag:0:0] INFO 
o.a.d.e.s.p.AbstractParquetScanBatchCreator - Finished parquet_runtime_pruning 
in 45444993 usec. Out of given 2 rowgroups, 0 were pruned. 
2019-11-04 16:23:15,379 [223f3f7f-bcd5-8632-94e9-842495cdfd7d:frag:0:0] INFO 
o.a.d.e.s.p.AbstractParquetScanBatchCreator - Run-time pruning skipped for 1 
out of 2 rowgroups due to: java.lang.Integer cannot be cast to 
java.lang.Long{noformat}
There are some newer code changes there (from DRILL-4517 and DRILL-7314) but 
they did not seem to matter for this issue.

 

  

> Run-time rowgroup pruning match() fails on casting a Long to an Integer
> ---
>
> Key: DRILL-7244
> URL: https://issues.apache.org/jira/browse/DRILL-7244
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: Boaz Ben-Zvi
>Assignee: Vova Vysotskyi
>Priority: Major
> Fix For: 1.17.0
>
>
> See DRILL-7240 , where a temporary workaround was created, skipping pruning 
> (and logging) instead of this failure: 
> After a Parquet table is refreshed with selected "interesting" columns, a 
> query whose WHERE clause contains a condition on a "non interesting" INT64 
> column fails during run-time pruning (calling match()) with:
> {noformat}
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
> {noformat}
> A long term solution is to pass the whole (or the relevant part of the) 
> schema to the runtime, instead of just passing the "interesting" columns.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage

2019-10-16 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953226#comment-16953226
 ] 

Boaz Ben-Zvi commented on DRILL-7405:
-

Seems to be working now, but that file should be placed on another storage, not 
AWS.

 

> Build fails due to inaccessible apache-drill on S3 storage
> --
>
> Key: DRILL-7405
> URL: https://issues.apache.org/jira/browse/DRILL-7405
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Tools, Build  Test
>Affects Versions: 1.16.0
>Reporter: Boaz Ben-Zvi
>Assignee: Abhishek Girish
>Priority: Minor
> Fix For: 1.17.0
>
>
>   A new clean build (e.g. after deleting the ~/.m2 local repository) would 
> fail now due to:  
> Access denied to: 
> [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz=DwMGaQ=C5b8zRQO1miGmBeVZ2LFWg=KLC1nKJ8dIOnUay2kR6CAw=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk=]
>  
> (e.g., for the test data  sf-0.01_tpc-h_parquet_typed.tgz )
> A new publicly available storage place is needed, plus appropriate changes in 
> Drill to get to these resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage

2019-10-15 Thread Boaz Ben-Zvi (Jira)
Boaz Ben-Zvi created DRILL-7405:
---

 Summary: Build fails due to inaccessible apache-drill on S3 storage
 Key: DRILL-7405
 URL: https://issues.apache.org/jira/browse/DRILL-7405
 Project: Apache Drill
  Issue Type: Bug
  Components: Tools, Build  Test
Affects Versions: 1.16.0
Reporter: Boaz Ben-Zvi
Assignee: Abhishek Girish


  A new clean build (e.g. after deleting the ~/.m2 local repository) would fail 
now due to:  

Access denied to: 
[http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz=DwMGaQ=C5b8zRQO1miGmBeVZ2LFWg=KLC1nKJ8dIOnUay2kR6CAw=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk=]
 

(e.g., for the test data  sf-0.01_tpc-h_parquet_typed.tgz )

A new publicly available storage place is needed, plus appropriate changes in 
Drill to get to these resources.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7170) IllegalStateException: Record count not set for this vector container

2019-10-04 Thread Boaz Ben-Zvi (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi resolved DRILL-7170.
-
  Reviewer: Sorabh Hamirwasia
Resolution: Fixed

> IllegalStateException: Record count not set for this vector container
> -
>
> Key: DRILL-7170
> URL: https://issues.apache.org/jira/browse/DRILL-7170
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Reporter: Sorabh Hamirwasia
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.17.0
>
>
> {code:java}
> Query: 
> /root/drillAutomation/master/framework/resources/Advanced/tpcds/tpcds_sf1/original/maprdb/json/query95.sql
> WITH ws_wh AS
> (
> SELECT ws1.ws_order_number,
> ws1.ws_warehouse_sk wh1,
> ws2.ws_warehouse_sk wh2
> FROM   web_sales ws1,
> web_sales ws2
> WHERE  ws1.ws_order_number = ws2.ws_order_number
> ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
> SELECT
> Count(DISTINCT ws_order_number) AS `order count` ,
> Sum(ws_ext_ship_cost)   AS `total shipping cost` ,
> Sum(ws_net_profit)  AS `total net profit`
> FROM web_sales ws1 ,
> date_dim ,
> customer_address ,
> web_site
> WHEREd_date BETWEEN '2000-04-01' AND  (
> Cast('2000-04-01' AS DATE) + INTERVAL '60' day)
> AND  ws1.ws_ship_date_sk = d_date_sk
> AND  ws1.ws_ship_addr_sk = ca_address_sk
> AND  ca_state = 'IN'
> AND  ws1.ws_web_site_sk = web_site_sk
> AND  web_company_name = 'pri'
> AND  ws1.ws_order_number IN
> (
> SELECT ws_order_number
> FROM   ws_wh)
> AND  ws1.ws_order_number IN
> (
> SELECT wr_order_number
> FROM   web_returns,
> ws_wh
> WHERE  wr_order_number = ws_wh.ws_order_number)
> ORDER BY count(DISTINCT ws_order_number)
> LIMIT 100
> Exception:
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Record count not 
> set for this vector container
> Fragment 2:3
> Please, refer to logs for more information.
> [Error Id: 4ed92fce-505b-40ba-ac0e-4a302c28df47 on drill87:31010]
>   (java.lang.IllegalStateException) Record count not set for this vector 
> container
> 
> org.apache.drill.shaded.guava.com.google.common.base.Preconditions.checkState():459
> org.apache.drill.exec.record.VectorContainer.getRecordCount():394
> org.apache.drill.exec.record.RecordBatchSizer.():720
> org.apache.drill.exec.record.RecordBatchSizer.():704
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate$BatchHolder.getActualSize():462
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate.getActualSize():964
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate.makeDebugString():973
> 
> org.apache.drill.exec.physical.impl.common.HashPartition.makeDebugString():601
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.makeDebugString():1313
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.executeBuildPhase():1105
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext():525
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.test.generated.HashAggregatorGen1068899.doWork():642
> org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext():296
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> 
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1669
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
> org.apache.drill.common.SelfCleaningRunnable.run():38
> 

[jira] [Assigned] (DRILL-7170) IllegalStateException: Record count not set for this vector container

2019-10-04 Thread Boaz Ben-Zvi (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi reassigned DRILL-7170:
---

Assignee: Boaz Ben-Zvi

> IllegalStateException: Record count not set for this vector container
> -
>
> Key: DRILL-7170
> URL: https://issues.apache.org/jira/browse/DRILL-7170
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Reporter: Sorabh Hamirwasia
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.17.0
>
>
> {code:java}
> Query: 
> /root/drillAutomation/master/framework/resources/Advanced/tpcds/tpcds_sf1/original/maprdb/json/query95.sql
> WITH ws_wh AS
> (
> SELECT ws1.ws_order_number,
> ws1.ws_warehouse_sk wh1,
> ws2.ws_warehouse_sk wh2
> FROM   web_sales ws1,
> web_sales ws2
> WHERE  ws1.ws_order_number = ws2.ws_order_number
> ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
> SELECT
> Count(DISTINCT ws_order_number) AS `order count` ,
> Sum(ws_ext_ship_cost)   AS `total shipping cost` ,
> Sum(ws_net_profit)  AS `total net profit`
> FROM web_sales ws1 ,
> date_dim ,
> customer_address ,
> web_site
> WHEREd_date BETWEEN '2000-04-01' AND  (
> Cast('2000-04-01' AS DATE) + INTERVAL '60' day)
> AND  ws1.ws_ship_date_sk = d_date_sk
> AND  ws1.ws_ship_addr_sk = ca_address_sk
> AND  ca_state = 'IN'
> AND  ws1.ws_web_site_sk = web_site_sk
> AND  web_company_name = 'pri'
> AND  ws1.ws_order_number IN
> (
> SELECT ws_order_number
> FROM   ws_wh)
> AND  ws1.ws_order_number IN
> (
> SELECT wr_order_number
> FROM   web_returns,
> ws_wh
> WHERE  wr_order_number = ws_wh.ws_order_number)
> ORDER BY count(DISTINCT ws_order_number)
> LIMIT 100
> Exception:
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Record count not 
> set for this vector container
> Fragment 2:3
> Please, refer to logs for more information.
> [Error Id: 4ed92fce-505b-40ba-ac0e-4a302c28df47 on drill87:31010]
>   (java.lang.IllegalStateException) Record count not set for this vector 
> container
> 
> org.apache.drill.shaded.guava.com.google.common.base.Preconditions.checkState():459
> org.apache.drill.exec.record.VectorContainer.getRecordCount():394
> org.apache.drill.exec.record.RecordBatchSizer.():720
> org.apache.drill.exec.record.RecordBatchSizer.():704
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate$BatchHolder.getActualSize():462
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate.getActualSize():964
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate.makeDebugString():973
> 
> org.apache.drill.exec.physical.impl.common.HashPartition.makeDebugString():601
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.makeDebugString():1313
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.executeBuildPhase():1105
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext():525
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.test.generated.HashAggregatorGen1068899.doWork():642
> org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext():296
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> 
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1669
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
> org.apache.drill.common.SelfCleaningRunnable.run():38
> 

[jira] [Commented] (DRILL-7170) IllegalStateException: Record count not set for this vector container

2019-10-04 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944813#comment-16944813
 ] 

Boaz Ben-Zvi commented on DRILL-7170:
-

Fixed the error (PR #1859 - commit ID d2645c7638a88a4afd162bc3f1e2d65353ca3a67 )

However note that the underlying OOM situation was not addressed !!  (Will 
require another Jira)

 

 

> IllegalStateException: Record count not set for this vector container
> -
>
> Key: DRILL-7170
> URL: https://issues.apache.org/jira/browse/DRILL-7170
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Reporter: Sorabh Hamirwasia
>Priority: Major
> Fix For: 1.17.0
>
>
> {code:java}
> Query: 
> /root/drillAutomation/master/framework/resources/Advanced/tpcds/tpcds_sf1/original/maprdb/json/query95.sql
> WITH ws_wh AS
> (
> SELECT ws1.ws_order_number,
> ws1.ws_warehouse_sk wh1,
> ws2.ws_warehouse_sk wh2
> FROM   web_sales ws1,
> web_sales ws2
> WHERE  ws1.ws_order_number = ws2.ws_order_number
> ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
> SELECT
> Count(DISTINCT ws_order_number) AS `order count` ,
> Sum(ws_ext_ship_cost)   AS `total shipping cost` ,
> Sum(ws_net_profit)  AS `total net profit`
> FROM web_sales ws1 ,
> date_dim ,
> customer_address ,
> web_site
> WHEREd_date BETWEEN '2000-04-01' AND  (
> Cast('2000-04-01' AS DATE) + INTERVAL '60' day)
> AND  ws1.ws_ship_date_sk = d_date_sk
> AND  ws1.ws_ship_addr_sk = ca_address_sk
> AND  ca_state = 'IN'
> AND  ws1.ws_web_site_sk = web_site_sk
> AND  web_company_name = 'pri'
> AND  ws1.ws_order_number IN
> (
> SELECT ws_order_number
> FROM   ws_wh)
> AND  ws1.ws_order_number IN
> (
> SELECT wr_order_number
> FROM   web_returns,
> ws_wh
> WHERE  wr_order_number = ws_wh.ws_order_number)
> ORDER BY count(DISTINCT ws_order_number)
> LIMIT 100
> Exception:
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Record count not 
> set for this vector container
> Fragment 2:3
> Please, refer to logs for more information.
> [Error Id: 4ed92fce-505b-40ba-ac0e-4a302c28df47 on drill87:31010]
>   (java.lang.IllegalStateException) Record count not set for this vector 
> container
> 
> org.apache.drill.shaded.guava.com.google.common.base.Preconditions.checkState():459
> org.apache.drill.exec.record.VectorContainer.getRecordCount():394
> org.apache.drill.exec.record.RecordBatchSizer.():720
> org.apache.drill.exec.record.RecordBatchSizer.():704
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate$BatchHolder.getActualSize():462
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate.getActualSize():964
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate.makeDebugString():973
> 
> org.apache.drill.exec.physical.impl.common.HashPartition.makeDebugString():601
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.makeDebugString():1313
> 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.executeBuildPhase():1105
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext():525
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.test.generated.HashAggregatorGen1068899.doWork():642
> org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext():296
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.record.AbstractRecordBatch.next():126
> org.apache.drill.exec.record.AbstractRecordBatch.next():116
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():63
> 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():141
> org.apache.drill.exec.record.AbstractRecordBatch.next():186
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> 
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1669
> 

[jira] [Commented] (DRILL-7170) IllegalStateException: Record count not set for this vector container

2019-09-09 Thread Boaz Ben-Zvi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926184#comment-16926184
 ] 

Boaz Ben-Zvi commented on DRILL-7170:
-

    This failure happens at the end of the build phase for the Hash Join, after 
all the build side input was read (and possibly some partitions have spilled), 
then last the hash-tables (and HJ helpers) are created one at a time, for each 
in-memory partition. As these creations consume more memory, we may run our of 
memory before all the in-memory partitions got their hash tables. So before 
each such partition is handled, the available memory is checked (using the 
postBuildCalc) and in case too little memory is left, this in-memory partition 
is spilled (to free memory for the hash tables of the other partitions).

    Looks like in this case the check ( {{postBuildCalc.shouldSpill()}} ) 
erroneously passed, so instead of spilling, the hash-table build begun, but 
later OOMed.  This OOM that happened mid-work had left some vector container 
allocated but not initialized, thus the error-message code that tried to print 
relevant information tried getting the record count from that (uninitialized) 
container and failed. 

   A possible *_work-around_*: Increase the "safety factor" of the memory 
calculator, thus triggering spills sooner and less likely to return {{false}} 
from {{postBuildCalc.shouldSpill()}} . The default setting is *1.0*, can try 
values like *1.5*, or *2.0*, etc. for the user configuration option, like: 

{{alter session set `exec.hashjoin.safety_factor` = 1.5}}

   The simplest *_code fix_* – catch this failure when the error message is 
prepared (and just print zero instead - around line 1105 in 
{{HashJoinBatch.java}}).

  Another fix - In {{getActualSize()}} in  {{HashTableTemplate.java}} - just 
return zero for any batchHolder whose VectorContainer is not initialize (i.e. 
{{false == hasRecordCount()}} ) .  (Seems that only the error message code 
calls {{getActualSize()}} ).

   A more advanced fix (in addition to the above) – In the case of the above 
OOM, catch that OOM, then clean up the partially built hash-table (and helper), 
and last spill that whole partition (to free more memory). This is a workaround 
for the wrong choice made by {{postBuildCalc.shouldSpill()}} . But implementing 
this fix would require more testing.

 

 

> IllegalStateException: Record count not set for this vector container
> -
>
> Key: DRILL-7170
> URL: https://issues.apache.org/jira/browse/DRILL-7170
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Reporter: Sorabh Hamirwasia
>Priority: Major
> Fix For: 1.17.0
>
>
> {code:java}
> Query: 
> /root/drillAutomation/master/framework/resources/Advanced/tpcds/tpcds_sf1/original/maprdb/json/query95.sql
> WITH ws_wh AS
> (
> SELECT ws1.ws_order_number,
> ws1.ws_warehouse_sk wh1,
> ws2.ws_warehouse_sk wh2
> FROM   web_sales ws1,
> web_sales ws2
> WHERE  ws1.ws_order_number = ws2.ws_order_number
> ANDws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
> SELECT
> Count(DISTINCT ws_order_number) AS `order count` ,
> Sum(ws_ext_ship_cost)   AS `total shipping cost` ,
> Sum(ws_net_profit)  AS `total net profit`
> FROM web_sales ws1 ,
> date_dim ,
> customer_address ,
> web_site
> WHEREd_date BETWEEN '2000-04-01' AND  (
> Cast('2000-04-01' AS DATE) + INTERVAL '60' day)
> AND  ws1.ws_ship_date_sk = d_date_sk
> AND  ws1.ws_ship_addr_sk = ca_address_sk
> AND  ca_state = 'IN'
> AND  ws1.ws_web_site_sk = web_site_sk
> AND  web_company_name = 'pri'
> AND  ws1.ws_order_number IN
> (
> SELECT ws_order_number
> FROM   ws_wh)
> AND  ws1.ws_order_number IN
> (
> SELECT wr_order_number
> FROM   web_returns,
> ws_wh
> WHERE  wr_order_number = ws_wh.ws_order_number)
> ORDER BY count(DISTINCT ws_order_number)
> LIMIT 100
> Exception:
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Record count not 
> set for this vector container
> Fragment 2:3
> Please, refer to logs for more information.
> [Error Id: 4ed92fce-505b-40ba-ac0e-4a302c28df47 on drill87:31010]
>   (java.lang.IllegalStateException) Record count not set for this vector 
> container
> 
> org.apache.drill.shaded.guava.com.google.common.base.Preconditions.checkState():459
> org.apache.drill.exec.record.VectorContainer.getRecordCount():394
> org.apache.drill.exec.record.RecordBatchSizer.():720
> org.apache.drill.exec.record.RecordBatchSizer.():704
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate$BatchHolder.getActualSize():462
> 
> org.apache.drill.exec.physical.impl.common.HashTableTemplate.getActualSize():964
> 
> 

[jira] [Commented] (DRILL-7169) Rename drill-root ArtifactID to apache-drill

2019-05-10 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16837599#comment-16837599
 ] 

Boaz Ben-Zvi commented on DRILL-7169:
-

I removed the "ready to commit" label until we have more agreement on the dev 
list for making this change. 

 

> Rename drill-root ArtifactID to apache-drill
> 
>
> Key: DRILL-7169
> URL: https://issues.apache.org/jira/browse/DRILL-7169
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Tools, Build  Test
>Affects Versions: 1.15.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
>Priority: Minor
> Fix For: Future
>
>
> Rename {{drill-root}} root POM ArtifactID to {{apache-drill, see:}}
> {{[https://github.com/apache/drill/blob/master/pom.xml#L32]}}
> Most of all Apache projects use short project name as artifactId.
> Rename it to {{apache-drill}} allow to use it as variable for drill build 
> process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7169) Rename drill-root ArtifactID to apache-drill

2019-05-10 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7169:

Labels:   (was: ready-to-commit)

> Rename drill-root ArtifactID to apache-drill
> 
>
> Key: DRILL-7169
> URL: https://issues.apache.org/jira/browse/DRILL-7169
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Tools, Build  Test
>Affects Versions: 1.15.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
>Priority: Minor
> Fix For: Future
>
>
> Rename {{drill-root}} root POM ArtifactID to {{apache-drill, see:}}
> {{[https://github.com/apache/drill/blob/master/pom.xml#L32]}}
> Most of all Apache projects use short project name as artifactId.
> Rename it to {{apache-drill}} allow to use it as variable for drill build 
> process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7244) Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-05-07 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7244:

Description: 
See DRILL-7240 , where a temporary workaround was created, skipping pruning 
(and logging) instead of this failure: 

After a Parquet table is refreshed with selected "interesting" columns, a query 
whose WHERE clause contains a condition on a "non interesting" INT64 column 
fails during run-time pruning (calling match()) with:
{noformat}
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
{noformat}
A long term solution is to pass the whole (or the relevant part of the) schema 
to the runtime, instead of just passing the "interesting" columns.

 

  was:
See -DRILL-7+240+-, where a temporary workaround was created, skipping pruning 
(and logging) instead of this failure: 

After a Parquet table is refreshed with selected "interesting" columns, a query 
whose WHERE clause contains a condition on a "non interesting" INT64 column 
fails during run-time pruning (calling match()) with:
{noformat}
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
{noformat}
A long term solution is to pass the whole (or the relevant part of the) schema 
to the runtime, instead of just passing the "interesting" columns.

 


> Run-time rowgroup pruning match() fails on casting a Long to an Integer
> ---
>
> Key: DRILL-7244
> URL: https://issues.apache.org/jira/browse/DRILL-7244
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.17.0
>
>
> See DRILL-7240 , where a temporary workaround was created, skipping pruning 
> (and logging) instead of this failure: 
> After a Parquet table is refreshed with selected "interesting" columns, a 
> query whose WHERE clause contains a condition on a "non interesting" INT64 
> column fails during run-time pruning (calling match()) with:
> {noformat}
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
> {noformat}
> A long term solution is to pass the whole (or the relevant part of the) 
> schema to the runtime, instead of just passing the "interesting" columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7244) Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-05-07 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7244:

Description: 
See -DRILL-7+240+-, where a temporary workaround was created, skipping pruning 
(and logging) instead of this failure: 

After a Parquet table is refreshed with selected "interesting" columns, a query 
whose WHERE clause contains a condition on a "non interesting" INT64 column 
fails during run-time pruning (calling match()) with:
{noformat}
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
{noformat}
A long term solution is to pass the whole (or the relevant part of the) schema 
to the runtime, instead of just passing the "interesting" columns.

 

  was:
See DRILL-7062, where a temporary workaround was created, skipping pruning (and 
logging) instead of this failure: 

After a Parquet table is refreshed with selected "interesting" columns, a query 
whose WHERE clause contains a condition on a "non interesting" INT64 column 
fails during run-time pruning (calling match()) with:
{noformat}
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
{noformat}
A long term solution is to pass the whole (or the relevant part of the) schema 
to the runtime, instead of just passing the "interesting" columns.

 


> Run-time rowgroup pruning match() fails on casting a Long to an Integer
> ---
>
> Key: DRILL-7244
> URL: https://issues.apache.org/jira/browse/DRILL-7244
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.17.0
>
>
> See -DRILL-7+240+-, where a temporary workaround was created, skipping 
> pruning (and logging) instead of this failure: 
> After a Parquet table is refreshed with selected "interesting" columns, a 
> query whose WHERE clause contains a condition on a "non interesting" INT64 
> column fails during run-time pruning (calling match()) with:
> {noformat}
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
> {noformat}
> A long term solution is to pass the whole (or the relevant part of the) 
> schema to the runtime, instead of just passing the "interesting" columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7244) Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-05-07 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7244:

Issue Type: Sub-task  (was: Bug)
Parent: DRILL-7028

> Run-time rowgroup pruning match() fails on casting a Long to an Integer
> ---
>
> Key: DRILL-7244
> URL: https://issues.apache.org/jira/browse/DRILL-7244
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.17.0
>
>
> See DRILL-7062, where a temporary workaround was created, skipping pruning 
> (and logging) instead of this failure: 
> After a Parquet table is refreshed with selected "interesting" columns, a 
> query whose WHERE clause contains a condition on a "non interesting" INT64 
> column fails during run-time pruning (calling match()) with:
> {noformat}
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
> {noformat}
> A long term solution is to pass the whole (or the relevant part of the) 
> schema to the runtime, instead of just passing the "interesting" columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7244) Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-05-07 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-7244:
---

 Summary: Run-time rowgroup pruning match() fails on casting a Long 
to an Integer
 Key: DRILL-7244
 URL: https://issues.apache.org/jira/browse/DRILL-7244
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.17.0
Reporter: Boaz Ben-Zvi
Assignee: Boaz Ben-Zvi
 Fix For: 1.17.0


See DRILL-7062, where a temporary workaround was created, skipping pruning (and 
logging) instead of this failure: 

After a Parquet table is refreshed with selected "interesting" columns, a query 
whose WHERE clause contains a condition on a "non interesting" INT64 column 
fails during run-time pruning (calling match()) with:
{noformat}
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
{noformat}
A long term solution is to pass the whole (or the relevant part of the) schema 
to the runtime, instead of just passing the "interesting" columns.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7240) Temp fix: Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-05-07 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7240:

Summary: Temp fix: Run-time rowgroup pruning match() fails on casting a 
Long to an Integer  (was: Run-time rowgroup pruning match() fails on casting a 
Long to an Integer)

> Temp fix: Run-time rowgroup pruning match() fails on casting a Long to an 
> Integer
> -
>
> Key: DRILL-7240
> URL: https://issues.apache.org/jira/browse/DRILL-7240
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> After a Parquet table is refreshed with selected "interesting" columns, a 
> query whose WHERE clause contains a condition on a "non interesting" INT64 
> column fails during run-time pruning (calling match()) with:
> {noformat}
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
> {noformat}
>  Near-term fix suggestion: Catch the match() exception error, and instead do 
> not prune (i.e. run-time pruning would be disabled in such cases).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7240) Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-05-07 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7240:

Labels: ready-to-commit  (was: )

> Run-time rowgroup pruning match() fails on casting a Long to an Integer
> ---
>
> Key: DRILL-7240
> URL: https://issues.apache.org/jira/browse/DRILL-7240
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> After a Parquet table is refreshed with selected "interesting" columns, a 
> query whose WHERE clause contains a condition on a "non interesting" INT64 
> column fails during run-time pruning (calling match()) with:
> {noformat}
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
> {noformat}
>  Near-term fix suggestion: Catch the match() exception error, and instead do 
> not prune (i.e. run-time pruning would be disabled in such cases).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7240) Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-05-03 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832964#comment-16832964
 ] 

Boaz Ben-Zvi commented on DRILL-7240:
-

Example recreating this bug - take this json file
{noformat}
{"key": "aa", "sales": 11}
{"key": "bb", "sales": 22}
{noformat}
And create two parquet tables/files by selecting from the json, first casting 
the "sales" to an INT, and the second to a BIGINT:
{noformat}
create table test_int as select key, cast(sales as int) sales from 
dfs.`/tmp/myfile.json`;
create table test_bigint as select key, cast(sales as bigint) sales from 
dfs.`/tmp/myfile.json`;
{noformat}
Then move the two files into a sub-directory, renaming the second:
{noformat}
$ > mv /tmp/test_int/0_0_0.parquet /tmp/test/sub
$ > mv /tmp/test_bigint/0_0_0.parquet /tmp/test/sub/0_0_1.parquet 
{noformat}
Last refresh on only the first "key" columns, then run a query with a predicate 
on the 'sales" column:
{noformat}
refresh table METADATA columns(key) dfs.`/tmp/test`;
select sales from dfs.`/tmp/test/` where sales > 10;
{noformat}


> Run-time rowgroup pruning match() fails on casting a Long to an Integer
> ---
>
> Key: DRILL-7240
> URL: https://issues.apache.org/jira/browse/DRILL-7240
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.17.0
>
>
> After a Parquet table is refreshed with selected "interesting" columns, a 
> query whose WHERE clause contains a condition on a "non interesting" INT64 
> column fails during run-time pruning (calling match()) with:
> {noformat}
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
> {noformat}
>  Near-term fix suggestion: Catch the match() exception error, and instead do 
> not prune (i.e. run-time pruning would be disabled in such cases).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7240) Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-05-03 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7240:

Description: 
After a Parquet table is refreshed with selected "interesting" columns, a query 
whose WHERE clause contains a condition on a "non interesting" INT64 column 
fails during run-time pruning (calling match()) with:
{noformat}
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
{noformat}
 Near-term fix suggestion: Catch the match() exception error, and instead do 
not prune (i.e. run-time pruning would be disabled in such cases).

  was:
After a Parquet table is refreshed with select "interesting" columns, a query 
whose WHERE clause contains a condition on a "non interesting" INT64 column 
fails during run-time pruning (calling match()) with:
{noformat}
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
{noformat}
 Near-term fix suggestion: Catch the match() exception error, and instead do 
not prune (i.e. run-time pruning would be disabled in such cases).


> Run-time rowgroup pruning match() fails on casting a Long to an Integer
> ---
>
> Key: DRILL-7240
> URL: https://issues.apache.org/jira/browse/DRILL-7240
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.17.0
>
>
> After a Parquet table is refreshed with selected "interesting" columns, a 
> query whose WHERE clause contains a condition on a "non interesting" INT64 
> column fails during run-time pruning (calling match()) with:
> {noformat}
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
> {noformat}
>  Near-term fix suggestion: Catch the match() exception error, and instead do 
> not prune (i.e. run-time pruning would be disabled in such cases).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7240) Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-05-03 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7240:

Issue Type: Sub-task  (was: Bug)
Parent: DRILL-7028

> Run-time rowgroup pruning match() fails on casting a Long to an Integer
> ---
>
> Key: DRILL-7240
> URL: https://issues.apache.org/jira/browse/DRILL-7240
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Storage - Parquet
>Affects Versions: 1.17.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.17.0
>
>
> After a Parquet table is refreshed with select "interesting" columns, a query 
> whose WHERE clause contains a condition on a "non interesting" INT64 column 
> fails during run-time pruning (calling match()) with:
> {noformat}
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
> {noformat}
>  Near-term fix suggestion: Catch the match() exception error, and instead do 
> not prune (i.e. run-time pruning would be disabled in such cases).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7240) Run-time rowgroup pruning match() fails on casting a Long to an Integer

2019-05-03 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-7240:
---

 Summary: Run-time rowgroup pruning match() fails on casting a Long 
to an Integer
 Key: DRILL-7240
 URL: https://issues.apache.org/jira/browse/DRILL-7240
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.17.0
Reporter: Boaz Ben-Zvi
Assignee: Boaz Ben-Zvi
 Fix For: 1.17.0


After a Parquet table is refreshed with select "interesting" columns, a query 
whose WHERE clause contains a condition on a "non interesting" INT64 column 
fails during run-time pruning (calling match()) with:
{noformat}
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
{noformat}
 Near-term fix suggestion: Catch the match() exception error, and instead do 
not prune (i.e. run-time pruning would be disabled in such cases).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-7223) Make the timeout in TimedCallable a configurable boot time parameter

2019-04-29 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi reassigned DRILL-7223:
---

Assignee: Boaz Ben-Zvi

> Make the timeout in TimedCallable a configurable boot time parameter
> 
>
> Key: DRILL-7223
> URL: https://issues.apache.org/jira/browse/DRILL-7223
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0
>Reporter: Aman Sinha
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: 1.17.0
>
>
> The 
> [TimedCallable.TIMEOUT_PER_RUNNABLE_IN_MSECS|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/TimedCallable.java#L52]
>  is currently an internal Drill constant defined as 15 secs. This has been 
> there from day 1 of the introduction. Drill's TimedCallable implements the 
> Java concurrency's Callable interface to create timed threads. It is used by 
> the REFRESH METADATA command which creates multiple threads on the Foreman 
> node to gather Parquet metadata to build the metadata cache.
> Depending on the load on the system or for very large scale number of parquet 
> files (millions) it is possible to exceed this timeout.  While the exact root 
> cause of exceeding the timeout is being investigated, it makes sense to make 
> this timeout a configurable parameter to aid with large scale testing. This 
> JIRA is to make this a configurable bootstrapping option in the 
> drill-override.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7062) Run-time row group pruning

2019-04-22 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7062:

Labels: ready-to-commit  (was: )

> Run-time row group pruning
> --
>
> Key: DRILL-7062
> URL: https://issues.apache.org/jira/browse/DRILL-7062
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Metadata
>Reporter: Venkata Jyothsna Donapati
>Assignee: Boaz Ben-Zvi
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7173) Analyze table may fail when prefer_plain_java is set to true on codegen for resetValues

2019-04-11 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-7173:
---

 Summary: Analyze table may fail when prefer_plain_java is set to 
true on codegen for resetValues 
 Key: DRILL-7173
 URL: https://issues.apache.org/jira/browse/DRILL-7173
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Codegen
Affects Versions: 1.15.0
 Environment: *prefer_plain_java: true*

 
Reporter: Boaz Ben-Zvi
 Fix For: 1.17.0


  The *prefer_plain_java* compile option is useful for debugging of generated 
code (can be set in dril-override.conf; the default value is false). When set 
to true, some "analyze table" calls generate code that fails due to addition of 
a SchemaChangeException which is not in the Streaming Aggr template.

For example:
{noformat}
apache drill (dfs.tmp)> create table lineitem3 as select * from 
cp.`tpch/lineitem.parquet`;
+--+---+
| Fragment | Number of records written |
+--+---+
| 0_0 | 60175 |
+--+---+
1 row selected (2.06 seconds)
apache drill (dfs.tmp)> analyze table lineitem3 compute statistics;
Error: SYSTEM ERROR: CompileException: File 
'org.apache.drill.exec.compile.DrillJavaFileObject[StreamingAggregatorGen4.java]',
 Line 7869, Column 20: StreamingAggregatorGen4.java:7869: error: resetValues() 
in org.apache.drill.exec.test.generated.StreamingAggregatorGen4 cannot override 
resetValues() in 
org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate
 public boolean resetValues()
 ^
 overridden method does not throw 
org.apache.drill.exec.exception.SchemaChangeException 
(compiler.err.override.meth.doesnt.throw)
{noformat}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7091) Query with EXISTS and correlated subquery fails with NPE in HashJoinMemoryCalculatorImpl$BuildSidePartitioningImpl

2019-03-11 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790186#comment-16790186
 ] 

Boaz Ben-Zvi commented on DRILL-7091:
-

The problem may be unrelated to the Hash-Join. Looks like the Hash-Join merely 
detects a missing part of the schema of the incoming batch from the Project 
operator.

The plan has a Project under the right side of the Hash-Join:
{noformat}
 Project(T5¦¦**=[$0], $f1=[ITEM($0, 'n_regionkey')]) : rowType = 
RecordType(DYNAMIC_STAR T5¦¦**, ANY $f1)
{noformat}
Then the Hash-Join tries to match "$f1" with the incoming schema, which +has 
only the first part+: "T5||**" .

Forced the Project push into the scan (by commenting out the "return;" on line 
110 in *DrillPushProjectIntoScanRule()* ); this "fixed" the schema of the 
Project batch to have the two fields.

So somehow Project only gets the first field expression.

 

> Query with EXISTS and correlated subquery fails with NPE in 
> HashJoinMemoryCalculatorImpl$BuildSidePartitioningImpl
> --
>
> Key: DRILL-7091
> URL: https://issues.apache.org/jira/browse/DRILL-7091
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Boaz Ben-Zvi
>Priority: Major
>
> Steps to reproduce:
> 1. Create view:
> {code:sql}
> create view dfs.tmp.nation_view as select * from cp.`tpch/nation.parquet`;
> {code}
> Run the following query:
> {code:sql}
> SELECT n_nationkey, n_name
> FROM  dfs.tmp.nation_view a
> WHERE EXISTS (SELECT 1
> FROM cp.`tpch/region.parquet` b
> WHERE b.r_regionkey =  a.n_regionkey)
> {code}
> This query fails with NPE:
> {noformat}
> [Error Id: 9a592635-f792-4403-965c-bd2eece7e8fc on cv1:31010]
>   at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633)
>  ~[drill-common-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:364)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:219)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:330)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_161]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_161]
>   at java.lang.Thread.run(Thread.java:748) [na:1.8.0_161]
> Caused by: java.lang.NullPointerException: null
>   at 
> org.apache.drill.exec.physical.impl.join.HashJoinMemoryCalculatorImpl$BuildSidePartitioningImpl.initialize(HashJoinMemoryCalculatorImpl.java:267)
>  ~[drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.executeBuildPhase(HashJoinBatch.java:959)
>  ~[drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext(HashJoinBatch.java:525)
>  ~[drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:186)
>  ~[drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:126)
>  ~[drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:116)
>  ~[drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:63)
>  ~[drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:141)
>  ~[drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:186)
>  ~[drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:126)
>  ~[drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.test.generated.HashAggregatorGen2.doWork(HashAggTemplate.java:642)
>  ~[na:na]
>   at 
> org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:295)
>  

[jira] [Created] (DRILL-7069) Poor performance of transformBinaryInMetadataCache

2019-02-28 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-7069:
---

 Summary: Poor performance of transformBinaryInMetadataCache
 Key: DRILL-7069
 URL: https://issues.apache.org/jira/browse/DRILL-7069
 Project: Apache Drill
  Issue Type: Improvement
  Components: Metadata
Affects Versions: 1.15.0
Reporter: Boaz Ben-Zvi
Assignee: Boaz Ben-Zvi
 Fix For: 1.16.0


The performance of the method *transformBinaryInMetadataCache* scales poorly as 
the table's numbers of underlying files, row-groups and columns grow. This 
method is invoked during planning of every query using this table.

     A test on a table using 219 directories (each with 20 files), 1 row-group 
in each file, and 94 columns, measured about *1340 milliseconds*.

    The main culprit are the version checks, which take place in *every 
iteration* (i.e., about 400k times in the previous example) and involve 
construction of 6 MetadataVersion objects (and possibly garbage collections).

     Removing the version checks from the loops improved this method's 
performance on the above test down to about *250 milliseconds*.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6799) Enhance the Hash-Join Operator to perform Anti-Semi-Join

2019-02-25 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6799:

Fix Version/s: (was: 1.16.0)
   1.17.0

> Enhance the Hash-Join Operator to perform Anti-Semi-Join
> 
>
> Key: DRILL-6799
> URL: https://issues.apache.org/jira/browse/DRILL-6799
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: 1.17.0
>
>
> Similar to handling Semi-Join (see DRILL-6735), the Anti-Semi-Join can be 
> enhanced by eliminating the extra DISTINCT (i.e. Hash-Aggr) operator.
> Example (note the NOT IN):
> select c.c_first_name, c.c_last_name from dfs.`/data/json/s1/customer` c 
> where c.c_customer_sk NOT IN (select s.ss_customer_sk from 
> dfs.`/data/json/s1/store_sales` s) limit 4;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6767) Simplify transfer of information from the planner to the operators

2019-02-25 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6767:

Fix Version/s: (was: 1.16.0)
   1.17.0

> Simplify transfer of information from the planner to the operators
> --
>
> Key: DRILL-6767
> URL: https://issues.apache.org/jira/browse/DRILL-6767
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: 1.17.0
>
>
> Currently little specific information known to the planner is passed to the 
> operators. For example, see the `joinType` parameter passed to the Join 
> operators (specifying whether this is a LEFT, RIGHT, INNER of FULL join). 
>  The relevant code passes this information explicitly via the constructors' 
> signature (e.g., see HashJoinPOP, AbstractJoinPop, etc), and uses specific 
> fields for this information, and affects all the test code using it, etc.
>  In the near future many more such "pieces of information" will possibly be 
> added to Drill, including:
>  (1) Is this a Semi (or Anti-Semi) join.
>  (2) `joinControl`
>  (3) `isRowKeyJoin`
>  (4) `isBroadcastJoin`
>  (5) Which join columns are not needed (DRILL-6758)
>  (6) Is this operator positioned between Lateral and UnNest.
>  (7) For Hash-Agg: Which phase (already implemented).
>  (8) For Hash-Agg: Perform COUNT  (DRILL-6836) 
> Each addition of such information would require a significant code change, 
> and add some code clutter.
> *Suggestion*: Instead pass a single object containing all the needed planner 
> information. So the next time another field is added, only that object needs 
> to be changed. (Ideally the whole plan could be passed, and then each 
> operator could poke and pick its needed fields)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6949) Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition the inner data any further" when Semi join is enabled

2019-02-25 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6949:

Fix Version/s: (was: 1.16.0)
   1.17.0

> Query fails with "UNSUPPORTED_OPERATION ERROR: Hash-Join can not partition 
> the inner data any further" when Semi join is enabled
> 
>
> Key: DRILL-6949
> URL: https://issues.apache.org/jira/browse/DRILL-6949
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.15.0
>Reporter: Abhishek Ravi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.17.0
>
> Attachments: 23cc1240-74ff-a0c0-8cd5-938fc136e4e2.sys.drill, 
> 23cc1369-0812-63ce-1861-872636571437.sys.drill
>
>
> Following query fails when with *Error: UNSUPPORTED_OPERATION ERROR: 
> Hash-Join can not partition the inner data any further (probably due to too 
> many join-key duplicates)* on TPC-H SF100 data.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 1;
> set `planner.enable_broadcast_join` = false;
> select
>  count(*)
> from
>  lineitem l1
> where
>  l1.l_discount IN (
>  select
>  distinct(cast(l2.l_discount as double))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> {code}
> The subquery contains *distinct* keyword and hence there should not be 
> duplicate values. 
> I suspect that the failure is caused by semijoin because the query succeeds 
> when semijoin is disabled explicitly.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6938) SQL get the wrong result after hashjoin and hashagg disabled

2019-02-25 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6938:

Fix Version/s: (was: 1.16.0)
   1.17.0

> SQL get the wrong result after hashjoin and hashagg disabled
> 
>
> Key: DRILL-6938
> URL: https://issues.apache.org/jira/browse/DRILL-6938
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Dony Dong
>Assignee: Boaz Ben-Zvi
>Priority: Critical
> Fix For: 1.17.0
>
>
> Hi Team
> After we disable hashjoin and hashagg to fix out of memory issue, we got the 
> wrong result.
> With these two parameters enabled, we will get 8 rows. After we disable them, 
> it only return 3 rows. It seems some MEM_ID had exclude before group or some 
> other step.
> select b.MEM_ID,count(distinct b.DEP_NO)
> from dfs.test.emp b
> where b.DEP_NO<>'-'
> and b.MEM_ID in ('68','412','852','117','657','816','135','751')
> and b.HIRE_DATE>'2014-06-01'
> group by b.MEM_ID
> order by 1;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6836) Eliminate StreamingAggr for COUNT DISTINCT

2019-02-25 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6836:

Fix Version/s: (was: 1.16.0)
   1.17.0

> Eliminate StreamingAggr for COUNT DISTINCT
> --
>
> Key: DRILL-6836
> URL: https://issues.apache.org/jira/browse/DRILL-6836
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: 1.17.0
>
>
> The COUNT DISTINCT operation is often implemented with a Hash-Aggr operator 
> for the DISTINCT, and a Streaming-Aggr above to perform the COUNT.  That 
> Streaming-Aggr does the counting like any aggregation, counting each value, 
> batch after batch.
>   While very efficient, that counting work is basically not needed, as the 
> Hash-Aggr knows the number of distinct values (in the in-memory partitions).
>   Hence _a possible small performance improvement_ - eliminate the 
> Streaming-Aggr operator, and notify the Hash-Aggr to return a COUNT (these 
> are Planner changes). The Hash-Aggr operator would need to generate the 
> single Float8 column output schema, and output that batch with a single 
> value, just like the Streaming -Aggr did (likely without generating code).
>   In case of a spill, the Hash-Aggr still needs to read and process those 
> partitions, to get the exact distinct number.
>    The expected improvement is the elimination of the batch by batch output 
> from the Hash-Aggr, and the batch by batch, row by row processing of the 
> Streaming-Aggr.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-4667) Improve memory footprint of broadcast joins

2019-02-25 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-4667:

Fix Version/s: (was: 1.16.0)
   1.17.0

> Improve memory footprint of broadcast joins
> ---
>
> Key: DRILL-4667
> URL: https://issues.apache.org/jira/browse/DRILL-4667
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.6.0
>Reporter: Aman Sinha
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.17.0
>
>
> For broadcast joins, currently Drill optimizes the data transfer across the 
> network for broadcast table by sending a single copy to the receiving node 
> which then distributes it to all minor fragments running on that particular 
> node.  However, each minor fragment builds its own hash table (for a hash 
> join) using this broadcast table.  We can substantially improve the memory 
> footprint by having a shared copy of the hash table among multiple minor 
> fragments on a node.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7025) Running Drill unit tests from within IntelliJ IDEA fails with FileNotFoundException: Source './src/test/resources/tpchmulti' does not exist

2019-02-20 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773371#comment-16773371
 ] 

Boaz Ben-Zvi commented on DRILL-7025:
-

_TestExampleQueries_  works for me as well with no special setting. 
[~vladimirsitnikov] Some setting in your configuration may be "unusual".  For 
me – under  "Run->Edit Configurations" the "Working Directory" is  
{{$MODULE_DIR$}} , and the "Use classpath of module" is selected as 
{{drill-java-exec}} .

> Running Drill unit tests from within IntelliJ IDEA fails with 
> FileNotFoundException: Source './src/test/resources/tpchmulti' does not exist
> ---
>
> Key: DRILL-7025
> URL: https://issues.apache.org/jira/browse/DRILL-7025
> Project: Apache Drill
>  Issue Type: Bug
>  Components:  Server, Tools, Build  Test
>Affects Versions: 1.15.0
>Reporter: Vladimir Sitnikov
>Priority: Major
>
> I start org.apache.drill.TestExampleQueries via regular "run tests" button, 
> and it produces the following exception:
> {noformat}
> java.lang.RuntimeException: This should not happen
> at 
> org.apache.drill.test.BaseDirTestWatcher.copyTo(BaseDirTestWatcher.java:298)
> at 
> org.apache.drill.test.BaseDirTestWatcher.copyResourceToRoot(BaseDirTestWatcher.java:223)
> at 
> org.apache.drill.TestExampleQueries.setupTestFiles(TestExampleQueries.java:42)
> Caused by: java.io.FileNotFoundException: Source 
> './src/test/resources/tpchmulti' does not exist
> at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:1074)
> at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:1038)
> at 
> org.apache.drill.test.BaseDirTestWatcher.copyTo(BaseDirTestWatcher.java:296)
> ... 2 more{noformat}
> In fact IDEA creates "run configuration" that uses 
> "/Users/vladimirsitnikov/Documents/work/drill" as "Working directory".
> That directory is a root where all drill sources are located (i.e. I have 
> /Users/vladimirsitnikov/Documents/work/drill/exec, 
> /Users/vladimirsitnikov/Documents/work/drill/exec/java-exec and so on).
> It looks like java-exec tests assume working directory is set to 
> /Users/vladimirsitnikov/Documents/work/drill/exec/java-exec, however it is 
> not the case when individual tests are run.
> The workaround is to add "exec/java-exec" to "working directory".
> It would be so much better if tests could be run from both working 
> directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7043) Enhance Merge-Join to support Full Outer Join

2019-02-19 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772590#comment-16772590
 ] 

Boaz Ben-Zvi commented on DRILL-7043:
-

This enhancement is becoming more useful as our storage begins to support 
"sortedness" - e.g., Secondary Indexes, and future Parquet Metadata (e.g., 
taken from Hive).

 

 

> Enhance Merge-Join to support Full Outer Join
> -
>
> Key: DRILL-7043
> URL: https://issues.apache.org/jira/browse/DRILL-7043
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.15.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
>
>    Currently the Merge Join operator internally cannot support a Right Outer 
> Join (and thus a Full Outer Join; for ROJ alone, the planner rotates the 
> inputs and specifies a Left Outer Join).
>    The actual reason for not supporting ROJ is the current MJ implementation 
> - when a match is found, it puts a mark on the right side and iterates down 
> on the right, resetting back at the end (and on to the next left side entry). 
>  This would create an ambiguity if the next left entry is bigger than the 
> previous - is this an unmatched (i.e., need to return the right entry), or 
> there was a prior match (i.e., just advance to the next right).
>    Seems that adding a relevant flag to the persisted state ({{status}}) and 
> some other code changes would make the operator support Right-Outer-Join as 
> well (and thus a Full Outer Join).  The planner need an update as well - to 
> suggest the MJ in case of a FOJ, and maybe not to rotate the inputs in some 
> MJ cases.
>    Currently trying a FOJ with MJ (i.e. HJ disabled) produces the following 
> "no plan found" from Calcite:
> {noformat}
> 0: jdbc:drill:zk=local> select * from temp t1 full outer join temp2 t2 on 
> t1.d_date = t2.d_date;
> Error: SYSTEM ERROR: CannotPlanException: Node 
> [rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]] could not be implemented; 
> planner state:
> Root: rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]
> Original rel:
> DrillScreenRel(subset=[rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]]): 
> rowcount = 6.0, cumulative cost = {0.6001 rows, 
> 0.6001 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2802
>   DrillProjectRel(subset=[rel#2801:Subset#7.LOGICAL.ANY([]).[]], **=[$0], 
> **0=[$2]): rowcount = 6.0, cumulative cost = {6.0 rows, 12.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 2800
> DrillJoinRel(subset=[rel#2799:Subset#6.LOGICAL.ANY([]).[]], 
> condition=[=($1, $3)], joinType=[full]): rowcount = 6.0, cumulative cost = 
> {10.0 rows, 104.0 cpu, 0.0 io, 0.0 network, 70.4 memory}, id = 2798
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (DRILL-7043) Enhance Merge-Join to support Full Outer Join

2019-02-19 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772590#comment-16772590
 ] 

Boaz Ben-Zvi edited comment on DRILL-7043 at 2/20/19 3:32 AM:
--

This enhancement is becoming more useful as our storage begins to support 
"sortedness" - e.g., Secondary Indexes, and future Parquet Metadata (e.g., 
taken from Hive). A Merge-Join on two sorted tables always out-performs a 
Hash-Join.

 

 

 


was (Author: ben-zvi):
This enhancement is becoming more useful as our storage begins to support 
"sortedness" - e.g., Secondary Indexes, and future Parquet Metadata (e.g., 
taken from Hive).

 

 

> Enhance Merge-Join to support Full Outer Join
> -
>
> Key: DRILL-7043
> URL: https://issues.apache.org/jira/browse/DRILL-7043
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.15.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
>
>    Currently the Merge Join operator internally cannot support a Right Outer 
> Join (and thus a Full Outer Join; for ROJ alone, the planner rotates the 
> inputs and specifies a Left Outer Join).
>    The actual reason for not supporting ROJ is the current MJ implementation 
> - when a match is found, it puts a mark on the right side and iterates down 
> on the right, resetting back at the end (and on to the next left side entry). 
>  This would create an ambiguity if the next left entry is bigger than the 
> previous - is this an unmatched (i.e., need to return the right entry), or 
> there was a prior match (i.e., just advance to the next right).
>    Seems that adding a relevant flag to the persisted state ({{status}}) and 
> some other code changes would make the operator support Right-Outer-Join as 
> well (and thus a Full Outer Join).  The planner need an update as well - to 
> suggest the MJ in case of a FOJ, and maybe not to rotate the inputs in some 
> MJ cases.
>    Currently trying a FOJ with MJ (i.e. HJ disabled) produces the following 
> "no plan found" from Calcite:
> {noformat}
> 0: jdbc:drill:zk=local> select * from temp t1 full outer join temp2 t2 on 
> t1.d_date = t2.d_date;
> Error: SYSTEM ERROR: CannotPlanException: Node 
> [rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]] could not be implemented; 
> planner state:
> Root: rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]
> Original rel:
> DrillScreenRel(subset=[rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]]): 
> rowcount = 6.0, cumulative cost = {0.6001 rows, 
> 0.6001 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2802
>   DrillProjectRel(subset=[rel#2801:Subset#7.LOGICAL.ANY([]).[]], **=[$0], 
> **0=[$2]): rowcount = 6.0, cumulative cost = {6.0 rows, 12.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 2800
> DrillJoinRel(subset=[rel#2799:Subset#6.LOGICAL.ANY([]).[]], 
> condition=[=($1, $3)], joinType=[full]): rowcount = 6.0, cumulative cost = 
> {10.0 rows, 104.0 cpu, 0.0 io, 0.0 network, 70.4 memory}, id = 2798
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7043) Enhance Merge-Join to support Full Outer Join

2019-02-19 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772589#comment-16772589
 ] 

Boaz Ben-Zvi commented on DRILL-7043:
-

Possible relevant other Jiras: DRILL-1059 (No FOJ with HJ disabled; though it 
shows a different error) and DRILL-4811 (TPCDS 51 - FOJ not supported).

 

> Enhance Merge-Join to support Full Outer Join
> -
>
> Key: DRILL-7043
> URL: https://issues.apache.org/jira/browse/DRILL-7043
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.15.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
>
>    Currently the Merge Join operator internally cannot support a Right Outer 
> Join (and thus a Full Outer Join; for ROJ alone, the planner rotates the 
> inputs and specifies a Left Outer Join).
>    The actual reason for not supporting ROJ is the current MJ implementation 
> - when a match is found, it puts a mark on the right side and iterates down 
> on the right, resetting back at the end (and on to the next left side entry). 
>  This would create an ambiguity if the next left entry is bigger than the 
> previous - is this an unmatched (i.e., need to return the right entry), or 
> there was a prior match (i.e., just advance to the next right).
>    Seems that adding a relevant flag to the persisted state ({{status}}) and 
> some other code changes would make the operator support Right-Outer-Join as 
> well (and thus a Full Outer Join).  The planner need an update as well - to 
> suggest the MJ in case of a FOJ, and maybe not to rotate the inputs in some 
> MJ cases.
>    Currently trying a FOJ with MJ (i.e. HJ disabled) produces the following 
> "no plan found" from Calcite:
> {noformat}
> 0: jdbc:drill:zk=local> select * from temp t1 full outer join temp2 t2 on 
> t1.d_date = t2.d_date;
> Error: SYSTEM ERROR: CannotPlanException: Node 
> [rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]] could not be implemented; 
> planner state:
> Root: rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]
> Original rel:
> DrillScreenRel(subset=[rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]]): 
> rowcount = 6.0, cumulative cost = {0.6001 rows, 
> 0.6001 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2802
>   DrillProjectRel(subset=[rel#2801:Subset#7.LOGICAL.ANY([]).[]], **=[$0], 
> **0=[$2]): rowcount = 6.0, cumulative cost = {6.0 rows, 12.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 2800
> DrillJoinRel(subset=[rel#2799:Subset#6.LOGICAL.ANY([]).[]], 
> condition=[=($1, $3)], joinType=[full]): rowcount = 6.0, cumulative cost = 
> {10.0 rows, 104.0 cpu, 0.0 io, 0.0 network, 70.4 memory}, id = 2798
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7043) Enhance Merge-Join to support Full Outer Join

2019-02-19 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-7043:
---

 Summary: Enhance Merge-Join to support Full Outer Join
 Key: DRILL-7043
 URL: https://issues.apache.org/jira/browse/DRILL-7043
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Relational Operators, Query Planning  
Optimization
Affects Versions: 1.15.0
Reporter: Boaz Ben-Zvi
Assignee: Boaz Ben-Zvi


   Currently the Merge Join operator internally cannot support a Right Outer 
Join (and thus a Full Outer Join; for ROJ alone, the planner rotates the inputs 
and specifies a Left Outer Join).

   The actual reason for not supporting ROJ is the current MJ implementation - 
when a match is found, it puts a mark on the right side and iterates down on 
the right, resetting back at the end (and on to the next left side entry).  
This would create an ambiguity if the next left entry is bigger than the 
previous - is this an unmatched (i.e., need to return the right entry), or 
there was a prior match (i.e., just advance to the next right).

   Seems that adding a relevant flag to the persisted state ({{status}}) and 
some other code changes would make the operator support Right-Outer-Join as 
well (and thus a Full Outer Join).  The planner need an update as well - to 
suggest the MJ in case of a FOJ, and maybe not to rotate the inputs in some MJ 
cases.

   Currently trying a FOJ with MJ (i.e. HJ disabled) produces the following "no 
plan found" from Calcite:
{noformat}
0: jdbc:drill:zk=local> select * from temp t1 full outer join temp2 t2 on 
t1.d_date = t2.d_date;
Error: SYSTEM ERROR: CannotPlanException: Node 
[rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]] could not be implemented; planner 
state:

Root: rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]
Original rel:
DrillScreenRel(subset=[rel#2804:Subset#8.PHYSICAL.SINGLETON([]).[]]): rowcount 
= 6.0, cumulative cost = {0.6001 rows, 0.6001 cpu, 0.0 
io, 0.0 network, 0.0 memory}, id = 2802
  DrillProjectRel(subset=[rel#2801:Subset#7.LOGICAL.ANY([]).[]], **=[$0], 
**0=[$2]): rowcount = 6.0, cumulative cost = {6.0 rows, 12.0 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 2800
DrillJoinRel(subset=[rel#2799:Subset#6.LOGICAL.ANY([]).[]], 
condition=[=($1, $3)], joinType=[full]): rowcount = 6.0, cumulative cost = 
{10.0 rows, 104.0 cpu, 0.0 io, 0.0 network, 70.4 memory}, id = 2798

{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (DRILL-2010) merge join returns wrong number of rows with large dataset

2019-02-19 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi closed DRILL-2010.
---
   Resolution: Cannot Reproduce
Fix Version/s: 1.15.0

Retested the same test - work for me. Looks like later changes in 
{{RecordIterator}} made the code work correctly across multiple (right side) 
batches.


> merge join returns wrong number of rows with large dataset
> --
>
> Key: DRILL-2010
> URL: https://issues.apache.org/jira/browse/DRILL-2010
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 0.8.0
>Reporter: Chun Chang
>Assignee: Boaz Ben-Zvi
>Priority: Critical
> Fix For: 1.15.0
>
> Attachments: DRILL-2010-1.patch, DRILL-2010-1.patch
>
>
> #Mon Jan 12 18:19:31 EST 2015
> git.commit.id.abbrev=5b012bf
> When data set is big enough (like larger than one batch size), merge join 
> will not returns the correct number of rows. Hash join returns the correct 
> number of rows. Data can be downloaded from:
> https://s3.amazonaws.com/apache-drill/files/complex100k.json.gz
> With this dataset, the following query should return 10,000,000. 
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_mergejoin` = true;
> +++
> | ok |  summary   |
> +++
> | true   | planner.enable_mergejoin updated. |
> +++
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_hashjoin` = false;
> +++
> | ok |  summary   |
> +++
> | true   | planner.enable_hashjoin updated. |
> +++
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from 
> `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi;
> ++
> |   EXPR$0   |
> ++
> | 9046760|
> ++
> 1 row selected (6.205 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_mergejoin` = false;
> +++
> | ok |  summary   |
> +++
> | true   | planner.enable_mergejoin updated. |
> +++
> 1 row selected (0.026 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_hashjoin` = true;
> +++
> | ok |  summary   |
> +++
> | true   | planner.enable_hashjoin updated. |
> +++
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from 
> `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi;
> ++
> |   EXPR$0   |
> ++
> | 1000   |
> ++
> 1 row selected (4.453 seconds)
> {code}
> With smaller dataset, both merge and hash join returns the same correct 
> number.
> physical plan for merge join:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select 
> count(a.id) from `complex100k.json` a inner join `complex100k.json` b on 
> a.gbyi=b.gbyi;
> +++
> |text|json|
> +++
> | 00-00Screen
> 00-01  StreamAgg(group=[{}], EXPR$0=[COUNT($0)])
> 00-02Project(id=[$1])
> 00-03  MergeJoin(condition=[=($0, $2)], joinType=[inner])
> 00-05SelectionVectorRemover
> 00-07  Sort(sort0=[$0], dir0=[ASC])
> 00-09Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/drill/testdata/complex_type/json/complex100k.json, 
> numFiles=1, columns=[`gbyi`, `id`], 
> files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]])
> 00-04Project(gbyi0=[$0])
> 00-06  SelectionVectorRemover
> 00-08Sort(sort0=[$0], dir0=[ASC])
> 00-10  Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/drill/testdata/complex_type/json/complex100k.json, 
> numFiles=1, columns=[`gbyi`], 
> files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]])
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-2010) merge join returns wrong number of rows with large dataset

2019-02-19 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi reassigned DRILL-2010:
---

Assignee: Boaz Ben-Zvi

> merge join returns wrong number of rows with large dataset
> --
>
> Key: DRILL-2010
> URL: https://issues.apache.org/jira/browse/DRILL-2010
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 0.8.0
>Reporter: Chun Chang
>Assignee: Boaz Ben-Zvi
>Priority: Critical
> Attachments: DRILL-2010-1.patch, DRILL-2010-1.patch
>
>
> #Mon Jan 12 18:19:31 EST 2015
> git.commit.id.abbrev=5b012bf
> When data set is big enough (like larger than one batch size), merge join 
> will not returns the correct number of rows. Hash join returns the correct 
> number of rows. Data can be downloaded from:
> https://s3.amazonaws.com/apache-drill/files/complex100k.json.gz
> With this dataset, the following query should return 10,000,000. 
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_mergejoin` = true;
> +++
> | ok |  summary   |
> +++
> | true   | planner.enable_mergejoin updated. |
> +++
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_hashjoin` = false;
> +++
> | ok |  summary   |
> +++
> | true   | planner.enable_hashjoin updated. |
> +++
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from 
> `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi;
> ++
> |   EXPR$0   |
> ++
> | 9046760|
> ++
> 1 row selected (6.205 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_mergejoin` = false;
> +++
> | ok |  summary   |
> +++
> | true   | planner.enable_mergejoin updated. |
> +++
> 1 row selected (0.026 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_hashjoin` = true;
> +++
> | ok |  summary   |
> +++
> | true   | planner.enable_hashjoin updated. |
> +++
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from 
> `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi;
> ++
> |   EXPR$0   |
> ++
> | 1000   |
> ++
> 1 row selected (4.453 seconds)
> {code}
> With smaller dataset, both merge and hash join returns the same correct 
> number.
> physical plan for merge join:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select 
> count(a.id) from `complex100k.json` a inner join `complex100k.json` b on 
> a.gbyi=b.gbyi;
> +++
> |text|json|
> +++
> | 00-00Screen
> 00-01  StreamAgg(group=[{}], EXPR$0=[COUNT($0)])
> 00-02Project(id=[$1])
> 00-03  MergeJoin(condition=[=($0, $2)], joinType=[inner])
> 00-05SelectionVectorRemover
> 00-07  Sort(sort0=[$0], dir0=[ASC])
> 00-09Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/drill/testdata/complex_type/json/complex100k.json, 
> numFiles=1, columns=[`gbyi`, `id`], 
> files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]])
> 00-04Project(gbyi0=[$0])
> 00-06  SelectionVectorRemover
> 00-08Sort(sort0=[$0], dir0=[ASC])
> 00-10  Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/drill/testdata/complex_type/json/complex100k.json, 
> numFiles=1, columns=[`gbyi`], 
> files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]])
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (DRILL-6914) Query with RuntimeFilter and SemiJoin fails with IllegalStateException: Memory was leaked by query

2019-02-08 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi resolved DRILL-6914.
-
Resolution: Fixed

The interaction between the Hash-Join spill and the runtime filter was fixed in 
PR #1622. Testing with the latest code works OK (no memory leaks).

 

> Query with RuntimeFilter and SemiJoin fails with IllegalStateException: 
> Memory was leaked by query
> --
>
> Key: DRILL-6914
> URL: https://issues.apache.org/jira/browse/DRILL-6914
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.15.0
>Reporter: Abhishek Ravi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.16.0
>
> Attachments: 23cc1af3-0e8e-b2c9-a889-a96504988d6c.sys.drill, 
> 23cc1b7c-5b5c-d123-5e72-6d7d2719df39.sys.drill
>
>
> Following query fails on TPC-H SF 100 dataset when 
> exec.hashjoin.enable.runtime_filter = true AND planner.enable_semijoin = true.
> Note that the query does not fail if any one of them or both are disabled.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 1;
> set `planner.enable_broadcast_join` = false;
> set `planner.enable_semijoin` = true;
> select
>  count(*) as row_count
> from
>  lineitem l1
> where
>  l1.l_shipdate IN (
>  select
>  distinct(cast(l2.l_shipdate as date))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> reset `planner.enable_semijoin`;
> {code}
>  
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010] 
> (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
> by query. Memory leaked: (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010]
> at 
> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:536)
> at org.apache.drill.jdbc.impl.DrillCursor.next(DrillCursor.java:640)
> at org.apache.calcite.avatica.AvaticaResultSet.next(AvaticaResultSet.java:217)
> at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.next(DrillResultSetImpl.java:151)
> at sqlline.BufferedRows.(BufferedRows.java:37)
> at sqlline.SqlLine.print(SqlLine.java:1716)
> at sqlline.Commands.execute(Commands.java:949)
> at sqlline.Commands.sql(Commands.java:882)
> at sqlline.SqlLine.dispatch(SqlLine.java:725)
> at sqlline.SqlLine.runCommands(SqlLine.java:1779)
> at sqlline.Commands.run(Commands.java:1485)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:38)
> at sqlline.SqlLine.dispatch(SqlLine.java:722)
> at sqlline.SqlLine.initArgs(SqlLine.java:458)
> at sqlline.SqlLine.begin(SqlLine.java:514)
> at sqlline.SqlLine.start(SqlLine.java:264)
> at sqlline.SqlLine.main(SqlLine.java:195)
> Caused by: org.apache.drill.common.exceptions.UserRemoteException: SYSTEM 
> ERROR: IllegalStateException: Memory was leaked by query. Memory leaked: 
> (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010]
> at 
> org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:123)
> at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:422)
> at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:96)
> at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:273)
> at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:243)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
> at 
> 

[jira] [Updated] (DRILL-7034) Window function over a malformed CSV file crashes the JVM

2019-02-08 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7034:

Attachment: janino8470007454663483217.java

> Window function over a malformed CSV file crashes the JVM 
> --
>
> Key: DRILL-7034
> URL: https://issues.apache.org/jira/browse/DRILL-7034
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.15.0
>Reporter: Boaz Ben-Zvi
>Priority: Major
> Attachments: hs_err_pid23450.log, janino8470007454663483217.java
>
>
> The JVM crashes executing window functions over (an ordered) CSV file with a 
> small format issue - an empty line.
> To create: Take the following simple `a.csvh` file:
> {noformat}
> amount
> 10
> 11
> {noformat}
> And execute a simple window function like
> {code:sql}
> select max(amount) over(order by amount) FROM dfs.`/data/a.csvh`;
> {code}
> Then add an empty line between the `10` and the `11`:
> {noformat}
> amount
> 10
> 11
> {noformat}
>  and try again:
> {noformat}
> 0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
> dfs.`/data/a.csvh`;
> +-+
> | EXPR$0  |
> +-+
> | 10  |
> | 11  |
> +-+
> 2 rows selected (3.554 seconds)
> 0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
> dfs.`/data/a.csvh`;
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0001064aeae7, pid=23450, tid=0x6103
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 
> 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # J 6719% C2 
> org.apache.drill.exec.expr.fn.impl.ByteFunctionHelpers.memcmp(JIIJII)I (188 
> bytes) @ 0x0001064aeae7 [0x0001064ae920+0x1c7]
> #
> # Core dump written. Default location: /cores/core or core.23450
> #
> # An error report file with more information is saved as:
> # /Users/boazben-zvi/IdeaProjects/drill/hs_err_pid23450.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> Abort trap: 6 (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7034) Window function over a malformed CSV file crashes the JVM

2019-02-08 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763977#comment-16763977
 ] 

Boaz Ben-Zvi commented on DRILL-7034:
-

Attached the generated code for the window function; note the `isPeer()` method 
there.

 

> Window function over a malformed CSV file crashes the JVM 
> --
>
> Key: DRILL-7034
> URL: https://issues.apache.org/jira/browse/DRILL-7034
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.15.0
>Reporter: Boaz Ben-Zvi
>Priority: Major
> Attachments: hs_err_pid23450.log, janino8470007454663483217.java
>
>
> The JVM crashes executing window functions over (an ordered) CSV file with a 
> small format issue - an empty line.
> To create: Take the following simple `a.csvh` file:
> {noformat}
> amount
> 10
> 11
> {noformat}
> And execute a simple window function like
> {code:sql}
> select max(amount) over(order by amount) FROM dfs.`/data/a.csvh`;
> {code}
> Then add an empty line between the `10` and the `11`:
> {noformat}
> amount
> 10
> 11
> {noformat}
>  and try again:
> {noformat}
> 0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
> dfs.`/data/a.csvh`;
> +-+
> | EXPR$0  |
> +-+
> | 10  |
> | 11  |
> +-+
> 2 rows selected (3.554 seconds)
> 0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
> dfs.`/data/a.csvh`;
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0001064aeae7, pid=23450, tid=0x6103
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 
> 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # J 6719% C2 
> org.apache.drill.exec.expr.fn.impl.ByteFunctionHelpers.memcmp(JIIJII)I (188 
> bytes) @ 0x0001064aeae7 [0x0001064ae920+0x1c7]
> #
> # Core dump written. Default location: /cores/core or core.23450
> #
> # An error report file with more information is saved as:
> # /Users/boazben-zvi/IdeaProjects/drill/hs_err_pid23450.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> Abort trap: 6 (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7034) Window function over a malformed CSV file crashes the JVM

2019-02-08 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763973#comment-16763973
 ] 

Boaz Ben-Zvi commented on DRILL-7034:
-

Maybe related to DRILL-4845 : "Malformed CSV throws IllegalArgumentException"

 

> Window function over a malformed CSV file crashes the JVM 
> --
>
> Key: DRILL-7034
> URL: https://issues.apache.org/jira/browse/DRILL-7034
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.15.0
>Reporter: Boaz Ben-Zvi
>Priority: Major
> Attachments: hs_err_pid23450.log
>
>
> The JVM crashes executing window functions over (an ordered) CSV file with a 
> small format issue - an empty line.
> To create: Take the following simple `a.csvh` file:
> {noformat}
> amount
> 10
> 11
> {noformat}
> And execute a simple window function like
> {code:sql}
> select max(amount) over(order by amount) FROM dfs.`/data/a.csvh`;
> {code}
> Then add an empty line between the `10` and the `11`:
> {noformat}
> amount
> 10
> 11
> {noformat}
>  and try again:
> {noformat}
> 0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
> dfs.`/data/a.csvh`;
> +-+
> | EXPR$0  |
> +-+
> | 10  |
> | 11  |
> +-+
> 2 rows selected (3.554 seconds)
> 0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
> dfs.`/data/a.csvh`;
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0001064aeae7, pid=23450, tid=0x6103
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 
> 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # J 6719% C2 
> org.apache.drill.exec.expr.fn.impl.ByteFunctionHelpers.memcmp(JIIJII)I (188 
> bytes) @ 0x0001064aeae7 [0x0001064ae920+0x1c7]
> #
> # Core dump written. Default location: /cores/core or core.23450
> #
> # An error report file with more information is saved as:
> # /Users/boazben-zvi/IdeaProjects/drill/hs_err_pid23450.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> Abort trap: 6 (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7034) Window function over a malformed CSV file crashes the JVM

2019-02-08 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763972#comment-16763972
 ] 

Boaz Ben-Zvi commented on DRILL-7034:
-

Attached the error log. Looks like the window generated code passed bad 
pointers to memcmp:
{noformat}
Stack: [0x7ea91000,0x7eb91000],  sp=0x7eb8efe0,  free 
space=1015k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 6719% C2 
org.apache.drill.exec.expr.fn.impl.ByteFunctionHelpers.memcmp(JIIJII)I (188 
bytes) @ 0x0001064aeae7 [0x0001064ae920+0x1c7]
j 
org.apache.drill.exec.expr.fn.impl.ByteFunctionHelpers.compare(Lio/netty/buffer/DrillBuf;IILio/netty/buffer/DrillBuf;II)I+25
j 
org.apache.drill.exec.test.generated.WindowFramerGen0.isPeer(ILorg/apache/drill/exec/record/VectorAccessible;ILorg/apache/drill/exec/record/VectorAccessible;)Z+305
j 
org.apache.drill.exec.physical.impl.window.FrameSupportTemplate.aggregatePeers(I)J+133
{noformat}


> Window function over a malformed CSV file crashes the JVM 
> --
>
> Key: DRILL-7034
> URL: https://issues.apache.org/jira/browse/DRILL-7034
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.15.0
>Reporter: Boaz Ben-Zvi
>Priority: Major
> Attachments: hs_err_pid23450.log
>
>
> The JVM crashes executing window functions over (an ordered) CSV file with a 
> small format issue - an empty line.
> To create: Take the following simple `a.csvh` file:
> {noformat}
> amount
> 10
> 11
> {noformat}
> And execute a simple window function like
> {code:sql}
> select max(amount) over(order by amount) FROM dfs.`/data/a.csvh`;
> {code}
> Then add an empty line between the `10` and the `11`:
> {noformat}
> amount
> 10
> 11
> {noformat}
>  and try again:
> {noformat}
> 0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
> dfs.`/data/a.csvh`;
> +-+
> | EXPR$0  |
> +-+
> | 10  |
> | 11  |
> +-+
> 2 rows selected (3.554 seconds)
> 0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
> dfs.`/data/a.csvh`;
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0001064aeae7, pid=23450, tid=0x6103
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 
> 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # J 6719% C2 
> org.apache.drill.exec.expr.fn.impl.ByteFunctionHelpers.memcmp(JIIJII)I (188 
> bytes) @ 0x0001064aeae7 [0x0001064ae920+0x1c7]
> #
> # Core dump written. Default location: /cores/core or core.23450
> #
> # An error report file with more information is saved as:
> # /Users/boazben-zvi/IdeaProjects/drill/hs_err_pid23450.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> Abort trap: 6 (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7034) Window function over a malformed CSV file crashes the JVM

2019-02-08 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7034:

Attachment: hs_err_pid23450.log

> Window function over a malformed CSV file crashes the JVM 
> --
>
> Key: DRILL-7034
> URL: https://issues.apache.org/jira/browse/DRILL-7034
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.15.0
>Reporter: Boaz Ben-Zvi
>Priority: Major
> Attachments: hs_err_pid23450.log
>
>
> The JVM crashes executing window functions over (an ordered) CSV file with a 
> small format issue - an empty line.
> To create: Take the following simple `a.csvh` file:
> {noformat}
> amount
> 10
> 11
> {noformat}
> And execute a simple window function like
> {code:sql}
> select max(amount) over(order by amount) FROM dfs.`/data/a.csvh`;
> {code}
> Then add an empty line between the `10` and the `11`:
> {noformat}
> amount
> 10
> 11
> {noformat}
>  and try again:
> {noformat}
> 0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
> dfs.`/data/a.csvh`;
> +-+
> | EXPR$0  |
> +-+
> | 10  |
> | 11  |
> +-+
> 2 rows selected (3.554 seconds)
> 0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
> dfs.`/data/a.csvh`;
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x0001064aeae7, pid=23450, tid=0x6103
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 
> 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode bsd-amd64 
> compressed oops)
> # Problematic frame:
> # J 6719% C2 
> org.apache.drill.exec.expr.fn.impl.ByteFunctionHelpers.memcmp(JIIJII)I (188 
> bytes) @ 0x0001064aeae7 [0x0001064ae920+0x1c7]
> #
> # Core dump written. Default location: /cores/core or core.23450
> #
> # An error report file with more information is saved as:
> # /Users/boazben-zvi/IdeaProjects/drill/hs_err_pid23450.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> Abort trap: 6 (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7034) Window function over a malformed CSV file crashes the JVM

2019-02-08 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-7034:
---

 Summary: Window function over a malformed CSV file crashes the JVM 
 Key: DRILL-7034
 URL: https://issues.apache.org/jira/browse/DRILL-7034
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 1.15.0
Reporter: Boaz Ben-Zvi


The JVM crashes executing window functions over (an ordered) CSV file with a 
small format issue - an empty line.

To create: Take the following simple `a.csvh` file:
{noformat}
amount
10
11
{noformat}

And execute a simple window function like
{code:sql}
select max(amount) over(order by amount) FROM dfs.`/data/a.csvh`;
{code}

Then add an empty line between the `10` and the `11`:
{noformat}
amount
10

11
{noformat}

 and try again:
{noformat}
0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
dfs.`/data/a.csvh`;
+-+
| EXPR$0  |
+-+
| 10  |
| 11  |
+-+
2 rows selected (3.554 seconds)
0: jdbc:drill:zk=local> select max(amount) over(order by amount) FROM 
dfs.`/data/a.csvh`;
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0001064aeae7, pid=23450, tid=0x6103
#
# JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 
1.8.0_181-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode bsd-amd64 
compressed oops)
# Problematic frame:
# J 6719% C2 
org.apache.drill.exec.expr.fn.impl.ByteFunctionHelpers.memcmp(JIIJII)I (188 
bytes) @ 0x0001064aeae7 [0x0001064ae920+0x1c7]
#
# Core dump written. Default location: /cores/core or core.23450
#
# An error report file with more information is saved as:
# /Users/boazben-zvi/IdeaProjects/drill/hs_err_pid23450.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Abort trap: 6 (core dumped)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (DRILL-7023) Query fails with IndexOutOfBoundsException after upgrade from drill 1.13.0 to drill 1.14.0

2019-02-01 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi closed DRILL-7023.
---
   Resolution: Fixed
 Assignee: Boaz Ben-Zvi
Fix Version/s: 1.15.0

Fixed in PR#1344.

 

> Query fails with IndexOutOfBoundsException after upgrade from drill 1.13.0 to 
> drill 1.14.0
> --
>
> Key: DRILL-7023
> URL: https://issues.apache.org/jira/browse/DRILL-7023
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.14.0
>Reporter: Khurram Faraaz
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.15.0
>
>
> Query fails with IndexOutOfBoundsException after upgrade from drill 1.13.0 to 
> drill 1.14.0
> {noformat}
> 2018-12-06 21:43:00,538 [23f5f79c-3777-eb37-ee46-f73be74381ef:frag:2:1] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IndexOutOfBoundsException
> Fragment 2:1
> [Error Id: 3b653503-b6da-4853-a395-317a169468ce on am1397.test.net:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IndexOutOfBoundsException
> Fragment 2:1
> [Error Id: 3b653503-b6da-4853-a395-317a169468ce on am1397.test.net:31010]
>  at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633)
>  ~[drill-common-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:361)
>  [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:216)
>  [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:327)
>  [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_152]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_152]
>  at java.lang.Thread.run(Thread.java:748) [na:1.8.0_152]
> Caused by: java.lang.IndexOutOfBoundsException: null
>  at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:677) 
> ~[drill-memory-base-1.14.0-mapr.jar:4.0.48.Final]
>  at 
> org.apache.drill.exec.vector.BigIntVector.copyEntry(BigIntVector.java:389) 
> ~[vector-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.test.generated.HashJoinProbeGen480.appendProbe(HashJoinProbeTemplate.java:190)
>  ~[na:na]
>  at 
> org.apache.drill.exec.test.generated.HashJoinProbeGen480.outputOuterRow(HashJoinProbeTemplate.java:223)
>  ~[na:na]
>  at 
> org.apache.drill.exec.test.generated.HashJoinProbeGen480.executeProbePhase(HashJoinProbeTemplate.java:357)
>  ~[na:na]
>  at 
> org.apache.drill.exec.test.generated.HashJoinProbeGen480.probeAndProject(HashJoinProbeTemplate.java:400)
>  ~[na:na]
>  at 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext(HashJoinBatch.java:465)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:172)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:63)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:172)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:69)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:172)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> 

[jira] [Commented] (DRILL-7023) Query fails with IndexOutOfBoundsException after upgrade from drill 1.13.0 to drill 1.14.0

2019-02-01 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758830#comment-16758830
 ] 

Boaz Ben-Zvi commented on DRILL-7023:
-

Apache 1.15.0 should not see this failure, as it was fixed in DRILL-6461 
(PR#1344 - committed on Aug 27, 2018).  The implementation of the call 
{{copyEntry()}} for {{BigIntVector}} (second from the top of the given failure 
stack) was changed in that PR to verify the size of the target vector, and 
reallocate the vector if the size was too small (for the index inserting into). 
Thus this failure should not occur in 1.15.0 .
  The root cause of the failure was the resizing up of the outgoing output 
batch in the Hash-Join; so the size variable was increased, but the actual 
allocated vectors were not reallocated, and remained of the old smaller size.



> Query fails with IndexOutOfBoundsException after upgrade from drill 1.13.0 to 
> drill 1.14.0
> --
>
> Key: DRILL-7023
> URL: https://issues.apache.org/jira/browse/DRILL-7023
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.14.0
>Reporter: Khurram Faraaz
>Priority: Major
>
> Query fails with IndexOutOfBoundsException after upgrade from drill 1.13.0 to 
> drill 1.14.0
> {noformat}
> 2018-12-06 21:43:00,538 [23f5f79c-3777-eb37-ee46-f73be74381ef:frag:2:1] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IndexOutOfBoundsException
> Fragment 2:1
> [Error Id: 3b653503-b6da-4853-a395-317a169468ce on am1397.test.net:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IndexOutOfBoundsException
> Fragment 2:1
> [Error Id: 3b653503-b6da-4853-a395-317a169468ce on am1397.test.net:31010]
>  at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633)
>  ~[drill-common-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:361)
>  [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:216)
>  [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:327)
>  [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_152]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_152]
>  at java.lang.Thread.run(Thread.java:748) [na:1.8.0_152]
> Caused by: java.lang.IndexOutOfBoundsException: null
>  at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:677) 
> ~[drill-memory-base-1.14.0-mapr.jar:4.0.48.Final]
>  at 
> org.apache.drill.exec.vector.BigIntVector.copyEntry(BigIntVector.java:389) 
> ~[vector-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.test.generated.HashJoinProbeGen480.appendProbe(HashJoinProbeTemplate.java:190)
>  ~[na:na]
>  at 
> org.apache.drill.exec.test.generated.HashJoinProbeGen480.outputOuterRow(HashJoinProbeTemplate.java:223)
>  ~[na:na]
>  at 
> org.apache.drill.exec.test.generated.HashJoinProbeGen480.executeProbePhase(HashJoinProbeTemplate.java:357)
>  ~[na:na]
>  at 
> org.apache.drill.exec.test.generated.HashJoinProbeGen480.probeAndProject(HashJoinProbeTemplate.java:400)
>  ~[na:na]
>  at 
> org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext(HashJoinBatch.java:465)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:172)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:63)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:172)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at 
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:69)

[jira] [Created] (DRILL-7015) Improve documentation for PARTITION BY

2019-01-29 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-7015:
---

 Summary: Improve documentation for PARTITION BY
 Key: DRILL-7015
 URL: https://issues.apache.org/jira/browse/DRILL-7015
 Project: Apache Drill
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.15.0
Reporter: Boaz Ben-Zvi
Assignee: Bridget Bevens
 Fix For: 1.16.0


The documentation for CREATE TABLE AS (CTAS) shows the syntax of the command, 
without the optional PARTITION BY clause. That option is only mentioned later 
under the usage notes.

*+_Suggestion_+*: Add this optional clause to the syntax (same as for CREATE 
TEMPORARY TABLE (CTTAS)). And mention that this option is only applicable when 
storing in Parquet. 

And the documentation for CREATE TEMPORARY TABLE (CTTAS), the comment says:
{panel}
An optional parameter that can *only* be used to create temporary tables with 
the Parquet data format. 
{panel}
Which can mistakenly be understood as "only for temporary tables". 
*_+Suggestion+_*: erase the "to create temporary tables" part (not needed, as 
it is implied from the context of this page).

*_+Last suggestion+_*: In the documentation for the PARTITION BY clause, can 
add an example using the implicit column "filename" to demonstrate how the 
partitioning column puts each distinct value into a separate file. For example, 
add in the "Other Examples" section :
{noformat}
0: jdbc:drill:zk=local> select distinct r_regionkey, filename from mytable1;
+--++
| r_regionkey  |filename|
+--++
| 2| 0_0_3.parquet  |
| 1| 0_0_2.parquet  |
| 0| 0_0_1.parquet  |
| 3| 0_0_4.parquet  |
| 4| 0_0_5.parquet  |
+--++
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7013) Hash-Join and Hash-Aggr to handle incoming with selection vectors

2019-01-28 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7013:

Description: 
  The Hash-Join and Hash-Aggr operators copy each incoming row separately. When 
the incoming data has a selection vector (e.g., outgoing from a Filter), a 
_SelectionVectorRemover_ is added before the Hash operator, as the latter 
cannot handle the selection vector.  

  Thus every row is needlessly being copied twice!

+Suggestion+: Enhance the Hash operators to handle potential incoming selection 
vectors, thus eliminating  the need for the extra copy. The planner needs to be 
changed not to add that SelectionVectorRemover.

(Two comments:
* Note the special case of Hash-Join with num_partitions = 1, where the build 
side vectors are used as is, not copied.
* Conflicts with the suggestion not to copy probe vectors, in DRILL-5912 )
 

For example:
{code:sql}
select * from cp.`tpch/lineitem.parquet` L,  cp.`tpch/orders.parquet` O where 
O.o_custkey > 1498 and L.l_orderkey > 58999 and O.o_orderkey = L.l_orderkey 
{code}
And the plan:
{panel}
00-00 Screen : rowType = RecordType(DYNAMIC_STAR **, DYNAMIC_STAR **0): 
 00-01 ProjectAllowDup(=[$0], 0=[$1]) : rowType = RecordType(DYNAMIC_STAR , 
DYNAMIC_STAR 0): 
 00-02 Project(T44¦¦=[$0], T45¦¦=[$2]) : rowType = RecordType(DYNAMIC_STAR 
T44¦¦, DYNAMIC_STAR T45¦¦): 
 00-03 HashJoin(condition=[=($1, $4)], joinType=[inner], semi-join: =[false]) : 
rowType = RecordType(DYNAMIC_STAR T44¦¦, ANY l_orderkey, DYNAMIC_STAR T45¦¦, 
ANY o_custkey, ANY o_orderkey): 
 00-05   *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T44¦¦, 
ANY l_orderkey):
 00-07 Filter(condition=[>($1, 58999)]) : rowType = RecordType(DYNAMIC_STAR 
T44¦¦, ANY l_orderkey):
 00-09   Project(T44¦¦=[$0], l_orderkey=[$1]) : rowType = 
RecordType(DYNAMIC_STAR T44¦¦, ANY l_orderkey): 
 00-11 Scan(table=[[cp, tpch/lineitem.parquet]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=classpath:/tpch/lineitem.parquet]], 
 00-04   *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T45¦¦, 
ANY o_custkey, ANY o_orderkey):
 00-06 Filter(condition=[AND(>($1, 1498), >($2, 58999))]) : rowType = 
RecordType(DYNAMIC_STAR T45¦¦, ANY o_custkey, ANY o_orderkey): 
 00-08   Project(T45¦¦**=[$0], o_custkey=[$1], o_orderkey=[$2]) : rowType = 
RecordType(DYNAMIC_STAR T45¦¦, ANY o_custkey, ANY o_orderkey):
 00-10 Scan(table=[[cp, tpch/orders.parquet]],
{panel}

  was:
  The Hash-Join and Hash-Aggr operators copy each incoming row separately. When 
the incoming data has a selection vector (e.g., outgoing from a Filter), a 
_SelectionVectorRemover_ is added before the Hash operator, as the latter 
cannot handle the selection vector.  

  Thus every row is needlessly being copied twice!

+Suggestion+: Enhance the Hash operators to handle potential incoming selection 
vectors, thus eliminating  the need for the extra copy. The planner needs to be 
changed not to add that SelectionVectorRemover.

For example:
{code:sql}
select * from cp.`tpch/lineitem.parquet` L,  cp.`tpch/orders.parquet` O where 
O.o_custkey > 1498 and L.l_orderkey > 58999 and O.o_orderkey = L.l_orderkey 
{code}
And the plan:
{panel}
00-00 Screen : rowType = RecordType(DYNAMIC_STAR **, DYNAMIC_STAR **0): 
 00-01 ProjectAllowDup(**=[$0], **0=[$1]) : rowType = RecordType(DYNAMIC_STAR 
**, DYNAMIC_STAR **0): 
 00-02 Project(T44¦¦**=[$0], T45¦¦**=[$2]) : rowType = RecordType(DYNAMIC_STAR 
T44¦¦**, DYNAMIC_STAR T45¦¦**): 
 00-03 HashJoin(condition=[=($1, $4)], joinType=[inner], semi-join: =[false]) : 
rowType = RecordType(DYNAMIC_STAR T44¦¦**, ANY l_orderkey, DYNAMIC_STAR 
T45¦¦**, ANY o_custkey, ANY o_orderkey): 
 00-05 *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T44¦¦**, 
ANY l_orderkey):
 00-07 Filter(condition=[>($1, 58999)]) : rowType = RecordType(DYNAMIC_STAR 
T44¦¦**, ANY l_orderkey):
 00-09 Project(T44¦¦**=[$0], l_orderkey=[$1]) : rowType = 
RecordType(DYNAMIC_STAR T44¦¦**, ANY l_orderkey): 
 00-11 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]], 
 00-04 *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T45¦¦**, 
ANY o_custkey, ANY o_orderkey):
 00-06 Filter(condition=[AND(>($1, 1498), >($2, 58999))]) : rowType = 
RecordType(DYNAMIC_STAR T45¦¦**, ANY o_custkey, ANY o_orderkey): 
 00-08 Project(T45¦¦**=[$0], o_custkey=[$1], o_orderkey=[$2]) : rowType = 
RecordType(DYNAMIC_STAR T45¦¦**, ANY o_custkey, ANY o_orderkey):
 00-10 Scan(table=[[cp, tpch/orders.parquet]],
{panel}


> Hash-Join and Hash-Aggr to handle incoming with selection vectors
> -
>
> Key: DRILL-7013
> URL: https://issues.apache.org/jira/browse/DRILL-7013
> Project: Apache Drill
>  Issue 

[jira] [Created] (DRILL-7013) Hash-Join and Hash-Aggr to handle incoming with selection vectors

2019-01-28 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-7013:
---

 Summary: Hash-Join and Hash-Aggr to handle incoming with selection 
vectors
 Key: DRILL-7013
 URL: https://issues.apache.org/jira/browse/DRILL-7013
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Relational Operators, Query Planning  
Optimization
Affects Versions: 1.15.0
Reporter: Boaz Ben-Zvi


  The Hash-Join and Hash-Aggr operators copy each incoming row separately. When 
the incoming data has a selection vector (e.g., outgoing from a Filter), a 
_SelectionVectorRemover_ is added before the Hash operator, as the latter 
cannot handle the selection vector.  

  Thus every row is needlessly being copied twice!

+Suggestion+: Enhance the Hash operators to handle potential incoming selection 
vectors, thus eliminating  the need for the extra copy. The planner needs to be 
changed not to add that SelectionVectorRemover.

For example:
{code:sql}
select * from cp.`tpch/lineitem.parquet` L,  cp.`tpch/orders.parquet` O where 
O.o_custkey > 1498 and L.l_orderkey > 58999 and O.o_orderkey = L.l_orderkey 
{code}
And the plan:
{panel}
00-00 Screen : rowType = RecordType(DYNAMIC_STAR **, DYNAMIC_STAR **0): 
 00-01 ProjectAllowDup(**=[$0], **0=[$1]) : rowType = RecordType(DYNAMIC_STAR 
**, DYNAMIC_STAR **0): 
 00-02 Project(T44¦¦**=[$0], T45¦¦**=[$2]) : rowType = RecordType(DYNAMIC_STAR 
T44¦¦**, DYNAMIC_STAR T45¦¦**): 
 00-03 HashJoin(condition=[=($1, $4)], joinType=[inner], semi-join: =[false]) : 
rowType = RecordType(DYNAMIC_STAR T44¦¦**, ANY l_orderkey, DYNAMIC_STAR 
T45¦¦**, ANY o_custkey, ANY o_orderkey): 
 00-05 *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T44¦¦**, 
ANY l_orderkey):
 00-07 Filter(condition=[>($1, 58999)]) : rowType = RecordType(DYNAMIC_STAR 
T44¦¦**, ANY l_orderkey):
 00-09 Project(T44¦¦**=[$0], l_orderkey=[$1]) : rowType = 
RecordType(DYNAMIC_STAR T44¦¦**, ANY l_orderkey): 
 00-11 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]], 
 00-04 *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T45¦¦**, 
ANY o_custkey, ANY o_orderkey):
 00-06 Filter(condition=[AND(>($1, 1498), >($2, 58999))]) : rowType = 
RecordType(DYNAMIC_STAR T45¦¦**, ANY o_custkey, ANY o_orderkey): 
 00-08 Project(T45¦¦**=[$0], o_custkey=[$1], o_orderkey=[$2]) : rowType = 
RecordType(DYNAMIC_STAR T45¦¦**, ANY o_custkey, ANY o_orderkey):
 00-10 Scan(table=[[cp, tpch/orders.parquet]],
{panel}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7012) Make SelectionVectorRemover project only the needed columns

2019-01-28 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7012:

Description: 
   A SelectionVectorRemover is often used after a filter, to copy into a newly 
allocated new batch only the "filtered out" rows. In some cases the columns 
used by the filter are not needed downstream; currently these columns are being 
needlessly allocated and copied, and later removed by a Project.

  _+Suggested improvement+_: The planner can pass the information about these 
columns to the SelectionVectorRemover, which would avoid this useless 
allocation and copy. The Planner would also eliminate that Project from the 
plan.

   Here is an example, the query:
{code:sql}
select max(l_quantity) from cp.`tpch/lineitem.parquet` L where L.l_orderkey > 
58999 and L.l_shipmode = 'TRUCK' group by l_linenumber ;
{code}
And the result plan (trimmed for readability), where "l_orderkey" and 
"l_shipmode" are removed by the Project:
{panel}
00-00 Screen : rowType = RecordType(ANY EXPR$0): 
 00-01 Project(EXPR$0=[$0]) : rowType = RecordType(ANY EXPR$0): 
 00-02 Project(EXPR$0=[$1]) : rowType = RecordType(ANY EXPR$0): 
 00-03 HashAgg(group=[\{0}], EXPR$0=[MAX($1)]) : rowType = RecordType(ANY 
l_linenumber, ANY EXPR$0): 
 00-04 *Project*(l_linenumber=[$2], l_quantity=[$3]) : rowType = RecordType(ANY 
l_linenumber, ANY l_quantity): 
 00-05 *SelectionVectorRemover* : rowType = RecordType(ANY *l_orderkey*, ANY 
*l_shipmode*, ANY l_linenumber, ANY l_quantity): 
 00-06 *Filter*(condition=[AND(>($0, 58999), =($1, 'TRUCK'))]) : rowType = 
RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity): 
 00-07 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]], 
selectionRoot=classpath:/tpch/lineitem.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`l_orderkey`, `l_shipmode`, `l_linenumber`, 
`l_quantity`]]]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY 
l_linenumber, ANY l_quantity):
{panel}
The implementation will not be simple, as the relevant code (e.g., 
GenericSV2Copier) has no idea of specific columns.

  was:
   A SelectionVectorRemover is often used after a filter, to copy into a newly 
allocated new batch only the "filtered out" rows. In some cases the columns 
used by the filter are not needed downstream; currently these columns are being 
needlessly allocated and copied, and later removed by a Project.

  _+Suggested improvement+_: The planner can pass the information about these 
columns to the SelectionVectorRemover, which would avoid this useless 
allocation and copy. The Planner would also eliminate that Project from the 
plan.

   Here is an example, the query:
{code:sql}
select max(l_quantity) from cp.`tpch/lineitem.parquet` L where L.l_orderkey > 
58999 and L.l_shipmode = 'TRUCK' group by l_linenumber ;
{code}
And the result plan (trimmed for readability), where "l_orderkey" and 
"l_shipmode" are removed by the Project:
{noformat}
00-00 Screen : rowType = RecordType(ANY EXPR$0): 
 00-01 Project(EXPR$0=[$0]) : rowType = RecordType(ANY EXPR$0): 
 00-02 Project(EXPR$0=[$1]) : rowType = RecordType(ANY EXPR$0): 
 00-03 HashAgg(group=[\{0}], EXPR$0=[MAX($1)]) : rowType = RecordType(ANY 
l_linenumber, ANY EXPR$0): 
 00-04 *Project*(l_linenumber=[$2], l_quantity=[$3]) : rowType = RecordType(ANY 
l_linenumber, ANY l_quantity): 
 00-05 *SelectionVectorRemover* : rowType = RecordType(ANY *l_orderkey*, ANY 
*l_shipmode*, ANY l_linenumber, ANY l_quantity): 
 00-06 *Filter*(condition=[AND(>($0, 58999), =($1, 'TRUCK'))]) : rowType = 
RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity): 
 00-07 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]], 
selectionRoot=classpath:/tpch/lineitem.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`l_orderkey`, `l_shipmode`, `l_linenumber`, 
`l_quantity`]]]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY 
l_linenumber, ANY l_quantity):
{noformat}
The implementation will not be simple, as the relevant code (e.g., 
GenericSV2Copier) has no idea of specific columns.


> Make SelectionVectorRemover project only the needed columns
> ---
>
> Key: DRILL-7012
> URL: https://issues.apache.org/jira/browse/DRILL-7012
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.15.0
>Reporter: Boaz Ben-Zvi
>Priority: Minor
>
>    A SelectionVectorRemover is often used after a filter, to copy into a 
> newly allocated new batch only the "filtered out" rows. In some cases the 

[jira] [Updated] (DRILL-7012) Make SelectionVectorRemover project only the needed columns

2019-01-28 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-7012:

Description: 
   A SelectionVectorRemover is often used after a filter, to copy into a newly 
allocated new batch only the "filtered out" rows. In some cases the columns 
used by the filter are not needed downstream; currently these columns are being 
needlessly allocated and copied, and later removed by a Project.

  _+Suggested improvement+_: The planner can pass the information about these 
columns to the SelectionVectorRemover, which would avoid this useless 
allocation and copy. The Planner would also eliminate that Project from the 
plan.

   Here is an example, the query:
{code:sql}
select max(l_quantity) from cp.`tpch/lineitem.parquet` L where L.l_orderkey > 
58999 and L.l_shipmode = 'TRUCK' group by l_linenumber ;
{code}
And the result plan (trimmed for readability), where "l_orderkey" and 
"l_shipmode" are removed by the Project:
{noformat}
00-00 Screen : rowType = RecordType(ANY EXPR$0): 
 00-01 Project(EXPR$0=[$0]) : rowType = RecordType(ANY EXPR$0): 
 00-02 Project(EXPR$0=[$1]) : rowType = RecordType(ANY EXPR$0): 
 00-03 HashAgg(group=[\{0}], EXPR$0=[MAX($1)]) : rowType = RecordType(ANY 
l_linenumber, ANY EXPR$0): 
 00-04 *Project*(l_linenumber=[$2], l_quantity=[$3]) : rowType = RecordType(ANY 
l_linenumber, ANY l_quantity): 
 00-05 *SelectionVectorRemover* : rowType = RecordType(ANY *l_orderkey*, ANY 
*l_shipmode*, ANY l_linenumber, ANY l_quantity): 
 00-06 *Filter*(condition=[AND(>($0, 58999), =($1, 'TRUCK'))]) : rowType = 
RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity): 
 00-07 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]], 
selectionRoot=classpath:/tpch/lineitem.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`l_orderkey`, `l_shipmode`, `l_linenumber`, 
`l_quantity`]]]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY 
l_linenumber, ANY l_quantity):
{noformat}
The implementation will not be simple, as the relevant code (e.g., 
GenericSV2Copier) has no idea of specific columns.

  was:
   A SelectionVectorRemover is often used after a filter, to copy into a newly 
allocated new batch only the "filtered out" rows. In some cases the columns 
used by the filter are not needed downstream; currently these columns are being 
needlessly allocated and copied, and later removed by a Project.

  _+Suggested improvement+_: The planner can pass the information about these 
columns to the SelectionVectorRemover, which would avoid this useless 
allocation and copy. The Planner would also eliminate that Project from the 
plan.

   Here is an example, the query:
{code:java}
select max(l_quantity) from cp.`tpch/lineitem.parquet` L where L.l_orderkey > 
58999 and L.l_shipmode = 'TRUCK' group by l_linenumber ;
{code}
And the result plan (trimmed for readability), where "l_orderkey" and 
"l_shipmode" are removed by the Project:
{noformat}
00-00 Screen : rowType = RecordType(ANY EXPR$0): 
 00-01 Project(EXPR$0=[$0]) : rowType = RecordType(ANY EXPR$0): 
 00-02 Project(EXPR$0=[$1]) : rowType = RecordType(ANY EXPR$0): 
 00-03 HashAgg(group=[\{0}], EXPR$0=[MAX($1)]) : rowType = RecordType(ANY 
l_linenumber, ANY EXPR$0): 
 00-04 *Project*(l_linenumber=[$2], l_quantity=[$3]) : rowType = RecordType(ANY 
l_linenumber, ANY l_quantity): 
 00-05 *SelectionVectorRemover* : rowType = RecordType(ANY *l_orderkey*, ANY 
*l_shipmode*, ANY l_linenumber, ANY l_quantity): 
 00-06 *Filter*(condition=[AND(>($0, 58999), =($1, 'TRUCK'))]) : rowType = 
RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity): 
 00-07 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]], 
selectionRoot=classpath:/tpch/lineitem.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`l_orderkey`, `l_shipmode`, `l_linenumber`, 
`l_quantity`]]]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY 
l_linenumber, ANY l_quantity):
{noformat}
The implementation will not be simple, as the relevant code (e.g., 
GenericSV2Copier) has no idea of specific columns.


> Make SelectionVectorRemover project only the needed columns
> ---
>
> Key: DRILL-7012
> URL: https://issues.apache.org/jira/browse/DRILL-7012
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.15.0
>Reporter: Boaz Ben-Zvi
>Priority: Minor
>
>    A SelectionVectorRemover is often used after a filter, to copy into a 
> newly allocated new batch only the "filtered out" rows. In some 

[jira] [Created] (DRILL-7012) Make SelectionVectorRemover project only the needed columns

2019-01-28 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-7012:
---

 Summary: Make SelectionVectorRemover project only the needed 
columns
 Key: DRILL-7012
 URL: https://issues.apache.org/jira/browse/DRILL-7012
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Relational Operators, Query Planning  
Optimization
Affects Versions: 1.15.0
Reporter: Boaz Ben-Zvi


   A SelectionVectorRemover is often used after a filter, to copy into a newly 
allocated new batch only the "filtered out" rows. In some cases the columns 
used by the filter are not needed downstream; currently these columns are being 
needlessly allocated and copied, and later removed by a Project.

  _+Suggested improvement+_: The planner can pass the information about these 
columns to the SelectionVectorRemover, which would avoid this useless 
allocation and copy. The Planner would also eliminate that Project from the 
plan.

   Here is an example, the query:
{code:java}
select max(l_quantity) from cp.`tpch/lineitem.parquet` L where L.l_orderkey > 
58999 and L.l_shipmode = 'TRUCK' group by l_linenumber ;
{code}
And the result plan (trimmed for readability), where "l_orderkey" and 
"l_shipmode" are removed by the Project:
{noformat}
00-00 Screen : rowType = RecordType(ANY EXPR$0): 
 00-01 Project(EXPR$0=[$0]) : rowType = RecordType(ANY EXPR$0): 
 00-02 Project(EXPR$0=[$1]) : rowType = RecordType(ANY EXPR$0): 
 00-03 HashAgg(group=[\{0}], EXPR$0=[MAX($1)]) : rowType = RecordType(ANY 
l_linenumber, ANY EXPR$0): 
 00-04 *Project*(l_linenumber=[$2], l_quantity=[$3]) : rowType = RecordType(ANY 
l_linenumber, ANY l_quantity): 
 00-05 *SelectionVectorRemover* : rowType = RecordType(ANY *l_orderkey*, ANY 
*l_shipmode*, ANY l_linenumber, ANY l_quantity): 
 00-06 *Filter*(condition=[AND(>($0, 58999), =($1, 'TRUCK'))]) : rowType = 
RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity): 
 00-07 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]], 
selectionRoot=classpath:/tpch/lineitem.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`l_orderkey`, `l_shipmode`, `l_linenumber`, 
`l_quantity`]]]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY 
l_linenumber, ANY l_quantity):
{noformat}
The implementation will not be simple, as the relevant code (e.g., 
GenericSV2Copier) has no idea of specific columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6914) Query with RuntimeFilter and SemiJoin fails with IllegalStateException: Memory was leaked by query

2019-01-09 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738957#comment-16738957
 ] 

Boaz Ben-Zvi commented on DRILL-6914:
-

This memory leak can be reproduced on SF1 (on the Mac) by forcing the Hash-Join 
to *spill*; e.g., by setting this _internal_ option ("spill if the number of 
batches in memory gets to 1000 "):
{code:java}
alter system set `exec.hashjoin.max_batches_in_memory` = 1000;
{code}
(Also removed the irrelevant 'distinct' and the 'cast' parts from the repro 
query).


 However could not reproduce by spilling on SF0, or by spilling with a regular 
(not semi) Hash-Join.
 Also tried the fix from PR#1600 (DRILL-6947), but it did not cure this memory 
leak. 

Maybe [~weijie] has some ideas about the cause for this leak. Looking at the 
Semi-Join code changes (PR#1522), none seems to me in any conflict with the 
runtime filter (maybe [~weijie] has a better idea).

 

 

> Query with RuntimeFilter and SemiJoin fails with IllegalStateException: 
> Memory was leaked by query
> --
>
> Key: DRILL-6914
> URL: https://issues.apache.org/jira/browse/DRILL-6914
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.15.0
>Reporter: Abhishek Ravi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.16.0
>
> Attachments: 23cc1af3-0e8e-b2c9-a889-a96504988d6c.sys.drill, 
> 23cc1b7c-5b5c-d123-5e72-6d7d2719df39.sys.drill
>
>
> Following query fails on TPC-H SF 100 dataset when 
> exec.hashjoin.enable.runtime_filter = true AND planner.enable_semijoin = true.
> Note that the query does not fail if any one of them or both are disabled.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 1;
> set `planner.enable_broadcast_join` = false;
> set `planner.enable_semijoin` = true;
> select
>  count(*) as row_count
> from
>  lineitem l1
> where
>  l1.l_shipdate IN (
>  select
>  distinct(cast(l2.l_shipdate as date))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> reset `planner.enable_semijoin`;
> {code}
>  
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010] 
> (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
> by query. Memory leaked: (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010]
> at 
> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:536)
> at org.apache.drill.jdbc.impl.DrillCursor.next(DrillCursor.java:640)
> at org.apache.calcite.avatica.AvaticaResultSet.next(AvaticaResultSet.java:217)
> at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.next(DrillResultSetImpl.java:151)
> at sqlline.BufferedRows.(BufferedRows.java:37)
> at sqlline.SqlLine.print(SqlLine.java:1716)
> at sqlline.Commands.execute(Commands.java:949)
> at sqlline.Commands.sql(Commands.java:882)
> at sqlline.SqlLine.dispatch(SqlLine.java:725)
> at sqlline.SqlLine.runCommands(SqlLine.java:1779)
> at sqlline.Commands.run(Commands.java:1485)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:38)
> at sqlline.SqlLine.dispatch(SqlLine.java:722)
> at sqlline.SqlLine.initArgs(SqlLine.java:458)
> at sqlline.SqlLine.begin(SqlLine.java:514)
> at sqlline.SqlLine.start(SqlLine.java:264)
> at sqlline.SqlLine.main(SqlLine.java:195)
> Caused by: org.apache.drill.common.exceptions.UserRemoteException: SYSTEM 
> ERROR: IllegalStateException: Memory was leaked by query. Memory leaked: 
> (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010]
> at 
> org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:123)
> at 

[jira] [Commented] (DRILL-6914) Query with RuntimeFilter and SemiJoin fails with IllegalStateException: Memory was leaked by query

2019-01-03 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733484#comment-16733484
 ] 

Boaz Ben-Zvi commented on DRILL-6914:
-

The `enable_semijoin` option is true by default, so setting it TRUE does not 
change anything; it would be useful to see if setting it to FALSE makes any 
difference.
The `runtime_filter` option is part of the "Bloom filter" feature, which AFAIK 
still has some issues, hence is not enabled by default.
And [~aravi5] - please attach the profile of the failed query so we could see 
the physical plan used, etc. 


> Query with RuntimeFilter and SemiJoin fails with IllegalStateException: 
> Memory was leaked by query
> --
>
> Key: DRILL-6914
> URL: https://issues.apache.org/jira/browse/DRILL-6914
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.15.0
>Reporter: Abhishek Ravi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.16.0
>
>
> Following query fails on TPC-H SF 100 dataset when 
> exec.hashjoin.enable.runtime_filter = true AND planner.enable_semijoin = true.
> Note that the query does not fail if any one of them or both are disabled.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 1;
> set `planner.enable_broadcast_join` = false;
> set `planner.enable_semijoin` = true;
> select
>  count(*) as row_count
> from
>  lineitem l1
> where
>  l1.l_shipdate IN (
>  select
>  distinct(cast(l2.l_shipdate as date))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> reset `planner.enable_semijoin`;
> {code}
>  
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010] 
> (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
> by query. Memory leaked: (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010]
> at 
> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:536)
> at org.apache.drill.jdbc.impl.DrillCursor.next(DrillCursor.java:640)
> at org.apache.calcite.avatica.AvaticaResultSet.next(AvaticaResultSet.java:217)
> at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.next(DrillResultSetImpl.java:151)
> at sqlline.BufferedRows.(BufferedRows.java:37)
> at sqlline.SqlLine.print(SqlLine.java:1716)
> at sqlline.Commands.execute(Commands.java:949)
> at sqlline.Commands.sql(Commands.java:882)
> at sqlline.SqlLine.dispatch(SqlLine.java:725)
> at sqlline.SqlLine.runCommands(SqlLine.java:1779)
> at sqlline.Commands.run(Commands.java:1485)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:38)
> at sqlline.SqlLine.dispatch(SqlLine.java:722)
> at sqlline.SqlLine.initArgs(SqlLine.java:458)
> at sqlline.SqlLine.begin(SqlLine.java:514)
> at sqlline.SqlLine.start(SqlLine.java:264)
> at sqlline.SqlLine.main(SqlLine.java:195)
> Caused by: org.apache.drill.common.exceptions.UserRemoteException: SYSTEM 
> ERROR: IllegalStateException: Memory was leaked by query. Memory leaked: 
> (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010]
> at 
> org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:123)
> at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:422)
> at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:96)
> at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:273)
> at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:243)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
> at 
> 

[jira] [Commented] (DRILL-6938) SQL get the wrong result after hashjoin and hashagg disabled

2019-01-02 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732555#comment-16732555
 ] 

Boaz Ben-Zvi commented on DRILL-6938:
-

Maybe this is something with 1.13.0 ; [~dony.dong] can you provide a repro 
`emp` table ?

I was testing with 1.15.0 and could not reproduce.

 

> SQL get the wrong result after hashjoin and hashagg disabled
> 
>
> Key: DRILL-6938
> URL: https://issues.apache.org/jira/browse/DRILL-6938
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Dony Dong
>Assignee: Boaz Ben-Zvi
>Priority: Critical
>
> Hi Team
> After we disable hashjoin and hashagg to fix out of memory issue, we got the 
> wrong result.
> With these two parameters enabled, we will get 8 rows. After we disable them, 
> it only return 3 rows. It seems some MEM_ID had exclude before group or some 
> other step.
> select b.MEM_ID,count(distinct b.DEP_NO)
> from dfs.test.emp b
> where b.DEP_NO<>'-'
> and b.MEM_ID in ('68','412','852','117','657','816','135','751')
> and b.HIRE_DATE>'2014-06-01'
> group by b.MEM_ID
> order by 1;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6914) Query with RuntimeFilter and SemiJoin fails with IllegalStateException: Memory was leaked by query

2019-01-02 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732543#comment-16732543
 ] 

Boaz Ben-Zvi commented on DRILL-6914:
-

The error does not show on SF1.

`planner.enable_semijoin` is true by default.  [~aravi5] - can you retest with 
this option set to FALSE ?

Can the profile of the failed query be attached to this Jira ?

 

> Query with RuntimeFilter and SemiJoin fails with IllegalStateException: 
> Memory was leaked by query
> --
>
> Key: DRILL-6914
> URL: https://issues.apache.org/jira/browse/DRILL-6914
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.15.0
>Reporter: Abhishek Ravi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.16.0
>
>
> Following query fails on TPC-H SF 100 dataset when 
> exec.hashjoin.enable.runtime_filter = true AND planner.enable_semijoin = true.
> Note that the query does not fail if any one of them or both are disabled.
> {code:sql}
> set `exec.hashjoin.enable.runtime_filter` = true;
> set `exec.hashjoin.runtime_filter.max.waiting.time` = 1;
> set `planner.enable_broadcast_join` = false;
> set `planner.enable_semijoin` = true;
> select
>  count(*) as row_count
> from
>  lineitem l1
> where
>  l1.l_shipdate IN (
>  select
>  distinct(cast(l2.l_shipdate as date))
>  from
>  lineitem l2);
> reset `exec.hashjoin.enable.runtime_filter`;
> reset `exec.hashjoin.runtime_filter.max.waiting.time`;
> reset `planner.enable_broadcast_join`;
> reset `planner.enable_semijoin`;
> {code}
>  
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010] 
> (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
> by query. Memory leaked: (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010]
> at 
> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:536)
> at org.apache.drill.jdbc.impl.DrillCursor.next(DrillCursor.java:640)
> at org.apache.calcite.avatica.AvaticaResultSet.next(AvaticaResultSet.java:217)
> at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.next(DrillResultSetImpl.java:151)
> at sqlline.BufferedRows.(BufferedRows.java:37)
> at sqlline.SqlLine.print(SqlLine.java:1716)
> at sqlline.Commands.execute(Commands.java:949)
> at sqlline.Commands.sql(Commands.java:882)
> at sqlline.SqlLine.dispatch(SqlLine.java:725)
> at sqlline.SqlLine.runCommands(SqlLine.java:1779)
> at sqlline.Commands.run(Commands.java:1485)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:38)
> at sqlline.SqlLine.dispatch(SqlLine.java:722)
> at sqlline.SqlLine.initArgs(SqlLine.java:458)
> at sqlline.SqlLine.begin(SqlLine.java:514)
> at sqlline.SqlLine.start(SqlLine.java:264)
> at sqlline.SqlLine.main(SqlLine.java:195)
> Caused by: org.apache.drill.common.exceptions.UserRemoteException: SYSTEM 
> ERROR: IllegalStateException: Memory was leaked by query. Memory leaked: 
> (134217728)
> Allocator(frag:1:0) 800/134217728/172453568/70126322567 
> (res/actual/peak/limit)
> Fragment 1:0
> Please, refer to logs for more information.
> [Error Id: ccee18b3-c3ff-4fdb-b314-23a6cfed0a0e on qa-node185.qa.lab:31010]
> at 
> org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:123)
> at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:422)
> at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:96)
> at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:273)
> at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:243)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
> at 
> 

[jira] [Commented] (DRILL-6938) SQL get the wrong result after hashjoin and hashagg disabled

2018-12-31 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731501#comment-16731501
 ] 

Boaz Ben-Zvi commented on DRILL-6938:
-

I tried to reproduce, but could not see this failure. I created a Parquet table 
with 500k rows, containing the above values (plus few other values), mixed at 
random. HIRE_DATE was of type DATE, the other two are VARCHAR.

Also forced the IN clause to produce a join:
{code}
alter session set `planner.in_subquery_threshold` = 2;
{code}

But this did not make a difference. The join was indeed implemented with a 
merge-join, which does not yet support Semi-Join functionality, however the 
query did return the expected result (even after adding duplicates in the 
in-list).


> SQL get the wrong result after hashjoin and hashagg disabled
> 
>
> Key: DRILL-6938
> URL: https://issues.apache.org/jira/browse/DRILL-6938
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Dony Dong
>Assignee: Boaz Ben-Zvi
>Priority: Critical
>
> Hi Team
> After we disable hashjoin and hashagg to fix out of memory issue, we got the 
> wrong result.
> With these two parameters enabled, we will get 8 rows. After we disable them, 
> it only return 3 rows. It seems some MEM_ID had exclude before group or some 
> other step.
> select b.MEM_ID,count(distinct b.DEP_NO)
> from dfs.test.emp b
> where b.DEP_NO<>'-'
> and b.MEM_ID in ('68','412','852','117','657','816','135','751')
> and b.HIRE_DATE>'2014-06-01'
> group by b.MEM_ID
> order by 1;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6880) Hash-Join: Many null keys on the build side form a long linked chain in the Hash Table

2018-12-28 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6880:

Description: 
When building the Hash Table for the Hash-Join, each new key is matched with an 
existing key (same bucket) by calling the generated method 
`isKeyMatchInternalBuild`, which compares the two. However when both keys are 
null, the method returns *false* (meaning not-equal; i.e. it is a new key), 
thus the new key is added into the list following the old key. When a third 
null key is found, it would be matched with the prior two, and added as well. 
Etc etc ...

This way many null values would perform checks at order N^2 / 2.

_Suggested improvement_: The generated code should return a third result, 
meaning "two null keys". Then in case of Inner or Left joins all the duplicate 
nulls can be discarded.

Below is a simple example, note the time difference between non-null and the 
all-nulls tables (also instrumentation showed that for nulls, the method above 
was called 1249975000 times!!)
{code:java}
0: jdbc:drill:zk=local> use dfs.tmp;
0: jdbc:drill:zk=local> create table testNull as (select cast(null as int) 
mycol from 
 dfs.`/data/test128M.tbl` limit 5);
0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
from 
 dfs.`/data/test128M.tbl` limit 6);
0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2 
from dfs.`/data/test128M.tbl` limit 5);
0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 = 
test2.mycol2;
+-+
| EXPR$0  |
+-+
| 0   |
+-+
1 row selected (0.443 seconds)
0: jdbc:drill:zk=local> select count(*) from test1 join testNull on 
test1.mycol1 = testNull.mycol;
+-+
| EXPR$0  |
+-+
| 0   |
+-+
1 row selected (140.098 seconds)
{code}

  was:
When building the Hash Table for the Hash-Join, each new key is matched with an 
existing key (same bucket) by calling the generated method 
`isKeyMatchInternalBuild`, which compares the two. However when both keys are 
null, the method returns *false* (meaning not-equal; i.e. it is a new key), 
thus the new key is added into the list following the old key. When a third 
null key is found, it would be matched with the prior two, and added as well. 
Etc etc ...

This way many null values would perform checks at order N^2 / 2.

Suggested improvement: The generated code should return a third result, meaning 
"two null keys". Then in case of Inner or Left joins all the duplicate nulls 
can be discarded.

Below is a simple example, note the time difference between non-null and the 
all-nulls tables (also instrumentation showed that for nulls, the method above 
was called 1249975000 times!!)
{code:java}
0: jdbc:drill:zk=local> use dfs.tmp;
0: jdbc:drill:zk=local> create table test as (select cast(null as int) mycol 
from 
 dfs.`/data/test128M.tbl` limit 5);
0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
from 
 dfs.`/data/test128M.tbl` limit 6);
0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2 
from dfs.`/data/test128M.tbl` limit 5);
0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 = 
test2.mycol2;
+-+
| EXPR$0  |
+-+
| 0   |
+-+
1 row selected (0.443 seconds)
0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
from dfs.`/data/test128M.tbl` limit 6);
+---++
| Fragment  | Number of records written  |
+---++
| 0_0   | 6  |
+---++
1 row selected (0.517 seconds)
0: jdbc:drill:zk=local> select count(*) from test1 join test on test1.mycol1 = 
test.mycol;
+-+
| EXPR$0  |
+-+
| 0   |
+-+
1 row selected (140.098 seconds)
{code}


> Hash-Join: Many null keys on the build side form a long linked chain in the 
> Hash Table
> --
>
> Key: DRILL-6880
> URL: https://issues.apache.org/jira/browse/DRILL-6880
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Critical
> Fix For: 1.16.0
>
>
> When building the Hash Table for the Hash-Join, each new key is matched with 
> an existing key (same bucket) by calling the generated method 
> `isKeyMatchInternalBuild`, which compares the two. However when both keys are 
> null, the method returns *false* (meaning not-equal; i.e. it is a new key), 
> thus the new key is added into the list following the old key. When a third 
> null key is found, it would be 

[jira] [Commented] (DRILL-6825) Applying different hash function according to data types and data size

2018-12-20 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726299#comment-16726299
 ] 

Boaz Ben-Zvi commented on DRILL-6825:
-

[~weijie] - When iterating, each data type vector may use a different hash 
function. Indeed for variable sized types (usually VARCHAR) a given hash 
function may not perform best if the value is long; however as these values are 
used as (join/aggr) keys, they are typically of a reasonable size (e.g. <= 16). 
If some users insists on using long keys, they deserve poor performance :) 

We could also have a collection of hash functions, and use some configuration 
map to assign each type its hash function.

The suggestion to extract all the key columns into a temporary buffer and then 
apply a single function over this buffer also has costs, like the copy and the 
inflexibility of using the same hash function for all.

Here is an example for a type specific hash function: For TIMESTAMP - take the 
MMDD part and XOR with the seed, then perform a (slower) hash on each byte 
of the microseconds part (the latter part usually has more entropy).

> Applying different hash function according to data types and data size
> --
>
> Key: DRILL-6825
> URL: https://issues.apache.org/jira/browse/DRILL-6825
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Codegen
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
> Fix For: 1.16.0
>
>
> Different hash functions have different performance according to different 
> data types and data size. We should choose a right one to apply not just 
> Murmurhash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6915) Unit test mysql-test-data.sql in contrib/jdbc-storage-plugin fails on newer MacOS

2018-12-19 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-6915:
---

 Summary: Unit test mysql-test-data.sql in 
contrib/jdbc-storage-plugin fails on newer MacOS
 Key: DRILL-6915
 URL: https://issues.apache.org/jira/browse/DRILL-6915
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - JDBC
Affects Versions: 1.14.0
 Environment: MacOS, either High Sierra (10.13) or Mojave (10.14).

 
Reporter: Boaz Ben-Zvi


The newer MacOS file systems (10.13 and above) are case-insensitive by default. 
This leads to the following unit test failure:
{code:java}
~/drill > mvn clean install -rf :drill-jdbc-storage
[INFO] Scanning for projects...
[INFO] 
[INFO] Detecting the operating system and CPU architecture
[INFO] 
[INFO] os.detected.name: osx
[INFO] os.detected.arch: x86_64
[INFO] os.detected.version: 10.14
.
[INFO] 
[INFO] Building contrib/jdbc-storage-plugin 1.15.0-SNAPSHOT
[INFO] 
.
[INFO] >> 2018-12-19 15:11:32 7136 [Warning] Setting lower_case_table_names=2 
because file system for __drill/contrib/storage-jdbc/target/mysql-data/data/ is 
case insensitive
.
[ERROR] Failed to execute:
create table CASESENSITIVETABLE (
a BLOB,
b BLOB
)
[INFO] 
[INFO] Reactor Summary:
[INFO]
[INFO] contrib/jdbc-storage-plugin  FAILURE [01:30 min]
...
[ERROR] Failed to execute goal org.codehaus.mojo:sql-maven-plugin:1.5:execute 
(create-tables) on project drill-jdbc-storage: Table 'casesensitivetable' 
already exists -> [Help 1]{code}
in the test file *mysql-test-data.sql*, where +both+ tables 
*caseSensitiveTable* and *CASESENSITIVETABLE* are created.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6912) NPE when other drillbit is already running

2018-12-18 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724639#comment-16724639
 ] 

Boaz Ben-Zvi commented on DRILL-6912:
-

[~amansinha100] - the difference is the environment variable *DRILL_EMBEDDED* :
{code:java}
~/test/apache-drill-1.15.0/bin 99 > ./sqlline -u jdbc:drill:zk=local -n admin 
--maxWidth=10
java.lang.NullPointerException
Apache Drill 1.15.0-SNAPSHOT
"Let's Drill something more solid than concrete."
sqlline> !q
~/test/apache-drill-1.15.0/bin 100 > export DRILL_EMBEDDED=1
~/test/apache-drill-1.15.0/bin 101 > ./sqlline -u jdbc:drill:zk=local -n admin 
--maxWidth=10
Picked up JAVA_TOOL_OPTIONS: -ea
ERROR: transport error 202: bind failed: Address already in use
ERROR: JDWP Transport dt_socket failed to initialize, TRANSPORT_INIT(510)
JDWP exit error AGENT_ERROR_TRANSPORT_INIT(197): No transports initialized 
[debugInit.c:750]
~/test/apache-drill-1.15.0/bin 102 >{code}

> NPE when other drillbit is already running
> --
>
> Key: DRILL-6912
> URL: https://issues.apache.org/jira/browse/DRILL-6912
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Vitalii Diravka
>Assignee: Boaz Ben-Zvi
>Priority: Critical
>
> If user tries to run the second drillbit process, the following output will 
> be obtained:
> {code:java}
> vitalii@vitalii-pc:/tmp/apache-drill-1.15.0$ bin/drill-embedded 
> java.lang.NullPointerException
> Apache Drill 1.15.0
> "This isn't your grandfather's SQL."
> sqlline> select * from (values(1));
> No current connection
> sqlline> !q
> {code}
> For 1.14.0 drill version the output was correct (but too long):
> {code:java}
> ./bin/drill-embedded 
> Dec 18, 2018 7:58:47 PM org.glassfish.jersey.server.ApplicationHandler 
> initialize
> INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 
> 01:25:26...
> Error: Failure in starting embedded Drillbit: java.net.BindException: Address 
> already in use (state=,code=0)
> java.sql.SQLException: Failure in starting embedded Drillbit: 
> java.net.BindException: Address already in use
>  at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:143)
>  at 
> org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:72)
>  at 
> org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:68)
>  at 
> org.apache.calcite.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:138)
>  at org.apache.drill.jdbc.Driver.connect(Driver.java:72)
>  at sqlline.DatabaseConnection.connect(DatabaseConnection.java:167)
>  at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:213)
>  at sqlline.Commands.connect(Commands.java:1083)
>  at sqlline.Commands.connect(Commands.java:1015)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36)
>  at sqlline.SqlLine.dispatch(SqlLine.java:742)
>  at sqlline.SqlLine.initArgs(SqlLine.java:528)
>  at sqlline.SqlLine.begin(SqlLine.java:596)
>  at sqlline.SqlLine.start(SqlLine.java:375)
>  at sqlline.SqlLine.main(SqlLine.java:268)
> Caused by: java.net.BindException: Address already in use
>  at sun.nio.ch.Net.bind0(Native Method)
>  at sun.nio.ch.Net.bind(Net.java:433)
>  at sun.nio.ch.Net.bind(Net.java:425)
>  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>  at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:279)
>  at 
> org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
>  at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:218)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>  at org.eclipse.jetty.server.Server.doStart(Server.java:337)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>  at org.apache.drill.exec.server.rest.WebServer.start(WebServer.java:155)
>  at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:200)
>  at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:134)
>  ... 18 more
> apache drill 1.14.0 
> "just drill it"
> 0: jdbc:drill:zk=local> !q{code}
> Looks like it is fine to have a short message in console about the reason of 
> error, similar to:
> {code:java}
> java.sql.SQLException: Failure in starting embedded Drillbit: 
> java.net.BindException: Address already in use
> {code}



--
This message was sent by 

[jira] [Commented] (DRILL-6912) NPE when other drillbit is already running

2018-12-18 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724580#comment-16724580
 ] 

Boaz Ben-Zvi commented on DRILL-6912:
-

Just tested with 1.15.0 RC0 and there was no NPE :
{code:java}
~/test/apache-drill-1.15.0 8 > ./bin/drill-embedded
Picked up JAVA_TOOL_OPTIONS: -ea
ERROR: transport error 202: bind failed: Address already in use
ERROR: JDWP Transport dt_socket failed to initialize, TRANSPORT_INIT(510)
JDWP exit error AGENT_ERROR_TRANSPORT_INIT(197): No transports initialized 
[debugInit.c:750]
~/test/apache-drill-1.15.0 9 >{code}

> NPE when other drillbit is already running
> --
>
> Key: DRILL-6912
> URL: https://issues.apache.org/jira/browse/DRILL-6912
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Vitalii Diravka
>Priority: Critical
>
> If user tries to run the second drillbit process, the following output will 
> be obtained:
> {code:java}
> vitalii@vitalii-pc:/tmp/apache-drill-1.15.0$ bin/drill-embedded 
> java.lang.NullPointerException
> Apache Drill 1.15.0
> "This isn't your grandfather's SQL."
> sqlline> select * from (values(1));
> No current connection
> sqlline> !q
> {code}
> For 1.14.0 drill version the output was correct (but too long):
> {code:java}
> ./bin/drill-embedded 
> Dec 18, 2018 7:58:47 PM org.glassfish.jersey.server.ApplicationHandler 
> initialize
> INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 
> 01:25:26...
> Error: Failure in starting embedded Drillbit: java.net.BindException: Address 
> already in use (state=,code=0)
> java.sql.SQLException: Failure in starting embedded Drillbit: 
> java.net.BindException: Address already in use
>  at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:143)
>  at 
> org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:72)
>  at 
> org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:68)
>  at 
> org.apache.calcite.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:138)
>  at org.apache.drill.jdbc.Driver.connect(Driver.java:72)
>  at sqlline.DatabaseConnection.connect(DatabaseConnection.java:167)
>  at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:213)
>  at sqlline.Commands.connect(Commands.java:1083)
>  at sqlline.Commands.connect(Commands.java:1015)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:36)
>  at sqlline.SqlLine.dispatch(SqlLine.java:742)
>  at sqlline.SqlLine.initArgs(SqlLine.java:528)
>  at sqlline.SqlLine.begin(SqlLine.java:596)
>  at sqlline.SqlLine.start(SqlLine.java:375)
>  at sqlline.SqlLine.main(SqlLine.java:268)
> Caused by: java.net.BindException: Address already in use
>  at sun.nio.ch.Net.bind0(Native Method)
>  at sun.nio.ch.Net.bind(Net.java:433)
>  at sun.nio.ch.Net.bind(Net.java:425)
>  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>  at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:279)
>  at 
> org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
>  at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:218)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>  at org.eclipse.jetty.server.Server.doStart(Server.java:337)
>  at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>  at org.apache.drill.exec.server.rest.WebServer.start(WebServer.java:155)
>  at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:200)
>  at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:134)
>  ... 18 more
> apache drill 1.14.0 
> "just drill it"
> 0: jdbc:drill:zk=local> !q{code}
> Looks like it is fine to have a short message in console about the reason of 
> error, similar to:
> {code:java}
> java.sql.SQLException: Failure in starting embedded Drillbit: 
> java.net.BindException: Address already in use
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6881) Hash-Table insert and probe: Compare hash values before keys

2018-12-12 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6881:

Fix Version/s: (was: 1.16.0)

> Hash-Table insert and probe: Compare hash values before keys
> 
>
> Key: DRILL-6881
> URL: https://issues.apache.org/jira/browse/DRILL-6881
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
>
>   When checking for existence of a key in the hash table (during _put_ or 
> _probe_ operations), the value of that key is compared (using generated code) 
> with a potential match key (same bucket). 
>    This comparison is slightly expensive (e.g., long keys, multi column keys, 
> checking null conditions, NaN, etc). Instead, if the hash-values of the two 
> keys are compared first (at practically zero cost), then the costly 
> comparison can be avoided in case the hash values don't match.
>  This code change is trivial, and given that the relevant Hash-Table code is 
> *hot code*, then even minute improvements could add up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6888) Nested classes in HashAggTemplate break the plain Java for debugging codegen

2018-12-10 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6888:

Labels: ready-to-commit  (was: )

> Nested classes in HashAggTemplate break the plain Java for debugging codegen
> 
>
> Key: DRILL-6888
> URL: https://issues.apache.org/jira/browse/DRILL-6888
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
>  Labels: ready-to-commit
> Attachments: janino5306141716524056052.java, 
> janino6744306210553474372.java
>
>
> The *prefer_plain_java* compile option is useful for debugging of generated 
> code.
>   DRILL-6719 ("separate spilling logic for Hash Agg") introduced two nested 
> classes into the HashAggTemplate class.  However those nested classes cause 
> the prefer_plain_java compile option to fail when compiling the generated 
> code, like:
> {code:java}
> Error: SYSTEM ERROR: CompileException: File 
> '/tmp/janino5709636998794673307.java', Line 36, Column 35: No applicable 
> constructor/method found for actual parameters 
> "org.apache.drill.exec.test.generated.HashAggregatorGen11$HashAggSpilledPartition";
>  candidates are: "protected 
> org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder 
> org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate.injectMembers(org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder)"
> {code}
> +The proposed fix+: Move those nested classes outside HashAgTemplate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6888) Nested classes in HashAggTemplate break the plain Java for debugging codegen

2018-12-07 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713534#comment-16713534
 ] 

Boaz Ben-Zvi commented on DRILL-6888:
-

[^janino6744306210553474372.java]  And after extracting the static class 
*HashAggSpilledPartition* into a separate file, the gen code still fails on the 
other nested non-static class - *HashAggUpdater*. 
{code:java}
0: jdbc:drill:zk=local> select sum(l_quantity),l_linenumber from 
cp.`tpch/lineitem.parquet` group by l_linenumber limit 2;
Error: SYSTEM ERROR: CompileException: File 
'/tmp/janino6744306210553474372.java', Line 33, Column 35: No applicable 
constructor/method found for actual parameters 
"org.apache.drill.exec.test.generated.HashAggregatorGen0$HashAggUpdater"; 
candidates are: "protected 
org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder 
org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate.injectMembers(org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder)"

Fragment 0:0

Please, refer to logs for more information.

[Error Id: ed224aba-c4d2-4e2c-aac8-79b926688744 on 10.254.64.18:31020] 
(state=,code=0){code}

> Nested classes in HashAggTemplate break the plain Java for debugging codegen
> 
>
> Key: DRILL-6888
> URL: https://issues.apache.org/jira/browse/DRILL-6888
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Attachments: janino5306141716524056052.java, 
> janino6744306210553474372.java
>
>
> The *prefer_plain_java* compile option is useful for debugging of generated 
> code.
>   DRILL-6719 ("separate spilling logic for Hash Agg") introduced two nested 
> classes into the HashAggTemplate class.  However those nested classes cause 
> the prefer_plain_java compile option to fail when compiling the generated 
> code, like:
> {code:java}
> Error: SYSTEM ERROR: CompileException: File 
> '/tmp/janino5709636998794673307.java', Line 36, Column 35: No applicable 
> constructor/method found for actual parameters 
> "org.apache.drill.exec.test.generated.HashAggregatorGen11$HashAggSpilledPartition";
>  candidates are: "protected 
> org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder 
> org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate.injectMembers(org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder)"
> {code}
> +The proposed fix+: Move those nested classes outside HashAgTemplate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6888) Nested classes in HashAggTemplate break the plain Java for debugging codegen

2018-12-07 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6888:

Attachment: janino6744306210553474372.java

> Nested classes in HashAggTemplate break the plain Java for debugging codegen
> 
>
> Key: DRILL-6888
> URL: https://issues.apache.org/jira/browse/DRILL-6888
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Attachments: janino5306141716524056052.java, 
> janino6744306210553474372.java
>
>
> The *prefer_plain_java* compile option is useful for debugging of generated 
> code.
>   DRILL-6719 ("separate spilling logic for Hash Agg") introduced two nested 
> classes into the HashAggTemplate class.  However those nested classes cause 
> the prefer_plain_java compile option to fail when compiling the generated 
> code, like:
> {code:java}
> Error: SYSTEM ERROR: CompileException: File 
> '/tmp/janino5709636998794673307.java', Line 36, Column 35: No applicable 
> constructor/method found for actual parameters 
> "org.apache.drill.exec.test.generated.HashAggregatorGen11$HashAggSpilledPartition";
>  candidates are: "protected 
> org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder 
> org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate.injectMembers(org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder)"
> {code}
> +The proposed fix+: Move those nested classes outside HashAgTemplate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6888) Nested classes in HashAggTemplate break the plain Java for debugging codegen

2018-12-07 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713519#comment-16713519
 ] 

Boaz Ben-Zvi commented on DRILL-6888:
-

[^janino5306141716524056052.java]Attaching the generated code for the following 
simple aggregation (ran over the master branch)
{code:java}
0: jdbc:drill:zk=local> select sum(l_quantity),l_linenumber from 
cp.`tpch/lineitem.parquet` group by l_linenumber limit 2;
Error: SYSTEM ERROR: CompileException: File 
'/tmp/janino5306141716524056052.java', Line 33, Column 35: No applicable 
constructor/method found for actual parameters 
"org.apache.drill.exec.test.generated.HashAggregatorGen0$HashAggSpilledPartition";
 candidates are: "protected 
org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder 
org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate.injectMembers(org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder)"

Fragment 0:0

Please, refer to logs for more information.

[Error Id: dd536835-1eae-47fa-812c-5da168528469 on 10.254.64.18:31020] 
(state=,code=0)
{code}

> Nested classes in HashAggTemplate break the plain Java for debugging codegen
> 
>
> Key: DRILL-6888
> URL: https://issues.apache.org/jira/browse/DRILL-6888
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Attachments: janino5306141716524056052.java
>
>
> The *prefer_plain_java* compile option is useful for debugging of generated 
> code.
>   DRILL-6719 ("separate spilling logic for Hash Agg") introduced two nested 
> classes into the HashAggTemplate class.  However those nested classes cause 
> the prefer_plain_java compile option to fail when compiling the generated 
> code, like:
> {code:java}
> Error: SYSTEM ERROR: CompileException: File 
> '/tmp/janino5709636998794673307.java', Line 36, Column 35: No applicable 
> constructor/method found for actual parameters 
> "org.apache.drill.exec.test.generated.HashAggregatorGen11$HashAggSpilledPartition";
>  candidates are: "protected 
> org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder 
> org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate.injectMembers(org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder)"
> {code}
> +The proposed fix+: Move those nested classes outside HashAgTemplate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6888) Nested classes in HashAggTemplate break the plain Java for debugging codegen

2018-12-07 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6888:

Attachment: janino5306141716524056052.java

> Nested classes in HashAggTemplate break the plain Java for debugging codegen
> 
>
> Key: DRILL-6888
> URL: https://issues.apache.org/jira/browse/DRILL-6888
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Attachments: janino5306141716524056052.java
>
>
> The *prefer_plain_java* compile option is useful for debugging of generated 
> code.
>   DRILL-6719 ("separate spilling logic for Hash Agg") introduced two nested 
> classes into the HashAggTemplate class.  However those nested classes cause 
> the prefer_plain_java compile option to fail when compiling the generated 
> code, like:
> {code:java}
> Error: SYSTEM ERROR: CompileException: File 
> '/tmp/janino5709636998794673307.java', Line 36, Column 35: No applicable 
> constructor/method found for actual parameters 
> "org.apache.drill.exec.test.generated.HashAggregatorGen11$HashAggSpilledPartition";
>  candidates are: "protected 
> org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder 
> org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate.injectMembers(org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder)"
> {code}
> +The proposed fix+: Move those nested classes outside HashAgTemplate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6888) Nested classes in HashAggTemplate break the plain Java for debugging codegen

2018-12-07 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-6888:
---

 Summary: Nested classes in HashAggTemplate break the plain Java 
for debugging codegen
 Key: DRILL-6888
 URL: https://issues.apache.org/jira/browse/DRILL-6888
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Relational Operators
Affects Versions: 1.14.0
Reporter: Boaz Ben-Zvi
Assignee: Boaz Ben-Zvi


The *prefer_plain_java* compile option is useful for debugging of generated 
code.

  DRILL-6719 ("separate spilling logic for Hash Agg") introduced two nested 
classes into the HashAggTemplate class.  However those nested classes cause the 
prefer_plain_java compile option to fail when compiling the generated code, 
like:
{code:java}
Error: SYSTEM ERROR: CompileException: File 
'/tmp/janino5709636998794673307.java', Line 36, Column 35: No applicable 
constructor/method found for actual parameters 
"org.apache.drill.exec.test.generated.HashAggregatorGen11$HashAggSpilledPartition";
 candidates are: "protected 
org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder 
org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate.injectMembers(org.apache.drill.exec.physical.impl.aggregate.HashAggTemplate$BatchHolder)"
{code}
+The proposed fix+: Move those nested classes outside HashAgTemplate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6707) Query with 10-way merge join fails with IllegalArgumentException

2018-12-07 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713405#comment-16713405
 ] 

Boaz Ben-Zvi commented on DRILL-6707:
-

    Here is a thesis on how this bug occurred:  After the DRILL-6123 change 
("Limit batch size for Merge Join"), the "fill position" in the output buffer 
is verified against the batch side (which used to be a fixed 32K). Now in some 
situations the operator's batch size may be +updated to a lower size+, while 
the output buffer is already partially filled. In such a case, the verification 
would fail.  If this is the case, eliminating the verification may solve the 
bug. (in JoinStatus.java:104).

  Need to have a way to reliably reproduce this bug - [~agirish] - can you try 
?   

> Query with 10-way merge join fails with IllegalArgumentException
> 
>
> Key: DRILL-6707
> URL: https://issues.apache.org/jira/browse/DRILL-6707
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.15.0
>Reporter: Abhishek Girish
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Attachments: drillbit.zip
>
>
> Query
> {code}
> SELECT   *
> FROM si.tpch_sf1_parquet.customer C,
>  si.tpch_sf1_parquet.orders O,
>  si.tpch_sf1_parquet.lineitem L,
>  si.tpch_sf1_parquet.part P,
>  si.tpch_sf1_parquet.supplier S,
>  si.tpch_sf1_parquet.partsupp PS,
>  si.tpch_sf1_parquet.nation S_N,
>  si.tpch_sf1_parquet.region S_R,
>  si.tpch_sf1_parquet.nation C_N,
>  si.tpch_sf1_parquet.region C_R
> WHEREC.C_CUSTKEY = O.O_CUSTKEY 
> AND  O.O_ORDERKEY = L.L_ORDERKEY
> AND  L.L_PARTKEY = P.P_PARTKEY
> AND  L.L_SUPPKEY = S.S_SUPPKEY
> AND  P.P_PARTKEY = PS.PS_PARTKEY
> AND  P.P_SUPPKEY = PS.PS_SUPPKEY
> AND  S.S_NATIONKEY = S_N.N_NATIONKEY
> AND  S_N.N_REGIONKEY = S_R.R_REGIONKEY
> AND  C.C_NATIONKEY = C_N.N_NATIONKEY
> AND  C_N.N_REGIONKEY = C_R.R_REGIONKEY
> {code}
> Plan
> {code}
> 00-00Screen : rowType = RecordType(DYNAMIC_STAR **, DYNAMIC_STAR **0, 
> DYNAMIC_STAR **1, DYNAMIC_STAR **2, DYNAMIC_STAR **3, DYNAMIC_STAR **4, 
> DYNAMIC_STAR **5, DYNAMIC_STAR **6, DYNAMIC_STAR **7, DYNAMIC_STAR **8): 
> rowcount = 6001215.0, cumulative cost = {1.151087965E8 rows, 
> 2.66710261332395E9 cpu, 3.198503E7 io, 5.172844544E11 network, 1.87681384E9 
> memory}, id = 419943
> 00-01  ProjectAllowDup(**=[$0], **0=[$1], **1=[$2], **2=[$3], **3=[$4], 
> **4=[$5], **5=[$6], **6=[$7], **7=[$8], **8=[$9]) : rowType = 
> RecordType(DYNAMIC_STAR **, DYNAMIC_STAR **0, DYNAMIC_STAR **1, DYNAMIC_STAR 
> **2, DYNAMIC_STAR **3, DYNAMIC_STAR **4, DYNAMIC_STAR **5, DYNAMIC_STAR **6, 
> DYNAMIC_STAR **7, DYNAMIC_STAR **8): rowcount = 6001215.0, cumulative cost = 
> {1.14508675E8 rows, 2.66650249182395E9 cpu, 3.198503E7 io, 5.172844544E11 
> network, 1.87681384E9 memory}, id = 419942
> 00-02UnionExchange : rowType = RecordType(DYNAMIC_STAR T19¦¦**, 
> DYNAMIC_STAR T18¦¦**, DYNAMIC_STAR T12¦¦**, DYNAMIC_STAR T17¦¦**, 
> DYNAMIC_STAR T13¦¦**, DYNAMIC_STAR T16¦¦**, DYNAMIC_STAR T14¦¦**, 
> DYNAMIC_STAR T15¦¦**, DYNAMIC_STAR T20¦¦**, DYNAMIC_STAR T21¦¦**): rowcount = 
> 6001215.0, cumulative cost = {1.0850746E8 rows, 2.60649034182395E9 cpu, 
> 3.198503E7 io, 5.172844544E11 network, 1.87681384E9 memory}, id = 419941
> 01-01  Project(T19¦¦**=[$0], T18¦¦**=[$3], T12¦¦**=[$6], 
> T17¦¦**=[$10], T13¦¦**=[$13], T16¦¦**=[$16], T14¦¦**=[$19], T15¦¦**=[$22], 
> T20¦¦**=[$24], T21¦¦**=[$27]) : rowType = RecordType(DYNAMIC_STAR T19¦¦**, 
> DYNAMIC_STAR T18¦¦**, DYNAMIC_STAR T12¦¦**, DYNAMIC_STAR T17¦¦**, 
> DYNAMIC_STAR T13¦¦**, DYNAMIC_STAR T16¦¦**, DYNAMIC_STAR T14¦¦**, 
> DYNAMIC_STAR T15¦¦**, DYNAMIC_STAR T20¦¦**, DYNAMIC_STAR T21¦¦**): rowcount = 
> 6001215.0, cumulative cost = {1.02506245E8 rows, 2.55848062182395E9 cpu, 
> 3.198503E7 io, 2.71474688E11 network, 1.87681384E9 memory}, id = 419940
> 01-02Project(T19¦¦**=[$21], C_CUSTKEY=[$22], C_NATIONKEY=[$23], 
> T18¦¦**=[$18], O_CUSTKEY=[$19], O_ORDERKEY=[$20], T12¦¦**=[$0], 
> L_ORDERKEY=[$1], L_PARTKEY=[$2], L_SUPPKEY=[$3], T17¦¦**=[$15], 
> P_PARTKEY=[$16], P_SUPPKEY=[$17], T13¦¦**=[$4], S_SUPPKEY=[$5], 
> S_NATIONKEY=[$6], T16¦¦**=[$12], PS_PARTKEY=[$13], PS_SUPPKEY=[$14], 
> T14¦¦**=[$7], N_NATIONKEY=[$8], N_REGIONKEY=[$9], T15¦¦**=[$10], 
> R_REGIONKEY=[$11], T20¦¦**=[$24], N_NATIONKEY0=[$25], N_REGIONKEY0=[$26], 
> T21¦¦**=[$27], R_REGIONKEY0=[$28]) : rowType = RecordType(DYNAMIC_STAR 
> T19¦¦**, ANY C_CUSTKEY, ANY C_NATIONKEY, DYNAMIC_STAR T18¦¦**, ANY O_CUSTKEY, 
> ANY O_ORDERKEY, DYNAMIC_STAR T12¦¦**, ANY L_ORDERKEY, ANY L_PARTKEY, ANY 
> L_SUPPKEY, DYNAMIC_STAR T17¦¦**, ANY P_PARTKEY, ANY 

[jira] [Created] (DRILL-6881) Hash-Table insert and probe: Compare hash values before keys

2018-12-04 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-6881:
---

 Summary: Hash-Table insert and probe: Compare hash values before 
keys
 Key: DRILL-6881
 URL: https://issues.apache.org/jira/browse/DRILL-6881
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Relational Operators
Affects Versions: 1.14.0
Reporter: Boaz Ben-Zvi
Assignee: Boaz Ben-Zvi
 Fix For: 1.16.0


  When checking for existence of a key in the hash table (during _put_ or 
_probe_ operations), the value of that key is compared (using generated code) 
with a potential match key (same bucket). 
   This comparison is slightly expensive (e.g., long keys, multi column keys, 
checking null conditions, NaN, etc). Instead, if the hash-values of the two 
keys are compared first (at practically zero cost), then the costly comparison 
can be avoided in case the hash values don't match.
 This code change is trivial, and given that the relevant Hash-Table code is 
*hot code*, then even minute improvements could add up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-6880) Hash-Join: Many null keys on the build side form a long linked chain in the Hash Table

2018-12-04 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi reassigned DRILL-6880:
---

Assignee: Boaz Ben-Zvi

> Hash-Join: Many null keys on the build side form a long linked chain in the 
> Hash Table
> --
>
> Key: DRILL-6880
> URL: https://issues.apache.org/jira/browse/DRILL-6880
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.16.0
>
>
> When building the Hash Table for the Hash-Join, each new key is matched with 
> an existing key (same bucket) by calling the generated method 
> `isKeyMatchInternalBuild`, which compares the two. However when both keys are 
> null, the method returns *false* (meaning not-equal; i.e. it is a new key), 
> thus the new key is added into the list following the old key. When a third 
> null key is found, it would be matched with the prior two, and added as well. 
> Etc etc ...
> This way many null values would perform checks at order N^2 / 2.
> Suggested improvement: The generated code should return a third result, 
> meaning "two null keys". Then in case of Inner or Left joins all the 
> duplicate nulls can be discarded.
> Below is a simple example, note the time difference between non-null and the 
> all-nulls tables (also instrumentation showed that for nulls, the method 
> above was called 1249975000 times!!)
> {code:java}
> 0: jdbc:drill:zk=local> use dfs.tmp;
> 0: jdbc:drill:zk=local> create table test as (select cast(null as int) mycol 
> from 
>  dfs.`/data/test128M.tbl` limit 5);
> 0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
> from 
>  dfs.`/data/test128M.tbl` limit 6);
> 0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2 
> from dfs.`/data/test128M.tbl` limit 5);
> 0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 
> = test2.mycol2;
> +-+
> | EXPR$0  |
> +-+
> | 0   |
> +-+
> 1 row selected (0.443 seconds)
> 0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
> from dfs.`/data/test128M.tbl` limit 6);
> +---++
> | Fragment  | Number of records written  |
> +---++
> | 0_0   | 6  |
> +---++
> 1 row selected (0.517 seconds)
> 0: jdbc:drill:zk=local> select count(*) from test1 join test on test1.mycol1 
> = test.mycol;
> +-+
> | EXPR$0  |
> +-+
> | 0   |
> +-+
> 1 row selected (140.098 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6880) Hash-Join: Many null keys on the build side form a long linked chain in the Hash Table

2018-12-04 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709594#comment-16709594
 ] 

Boaz Ben-Zvi commented on DRILL-6880:
-

Another "side issue" is the execution time of the generated method 
_isKeyMatchInternalBuild_ (which compares the keys, checks null conditions, 
NaN, etc). Normally that invocation time is less than a Micro-second, but on 
some occasions we've seen it go up to several Mili-Seconds (up to 117,000 uSec 
!!). 
 A big contributor to that is the Hotspot compiler, which executes in a 
"server" mode (i.e., spends more time optimizing, for a better result). When 
switched to a "client" mode (by adding the "-client" option to 
*DRILL_JAVA_OPTS*), those abnormal times diminished significantly.

> Hash-Join: Many null keys on the build side form a long linked chain in the 
> Hash Table
> --
>
> Key: DRILL-6880
> URL: https://issues.apache.org/jira/browse/DRILL-6880
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.16.0
>
>
> When building the Hash Table for the Hash-Join, each new key is matched with 
> an existing key (same bucket) by calling the generated method 
> `isKeyMatchInternalBuild`, which compares the two. However when both keys are 
> null, the method returns *false* (meaning not-equal; i.e. it is a new key), 
> thus the new key is added into the list following the old key. When a third 
> null key is found, it would be matched with the prior two, and added as well. 
> Etc etc ...
> This way many null values would perform checks at order N^2 / 2.
> Suggested improvement: The generated code should return a third result, 
> meaning "two null keys". Then in case of Inner or Left joins all the 
> duplicate nulls can be discarded.
> Below is a simple example, note the time difference between non-null and the 
> all-nulls tables (also instrumentation showed that for nulls, the method 
> above was called 1249975000 times!!)
> {code:java}
> 0: jdbc:drill:zk=local> use dfs.tmp;
> 0: jdbc:drill:zk=local> create table test as (select cast(null as int) mycol 
> from 
>  dfs.`/data/test128M.tbl` limit 5);
> 0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
> from 
>  dfs.`/data/test128M.tbl` limit 6);
> 0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2 
> from dfs.`/data/test128M.tbl` limit 5);
> 0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 
> = test2.mycol2;
> +-+
> | EXPR$0  |
> +-+
> | 0   |
> +-+
> 1 row selected (0.443 seconds)
> 0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
> from dfs.`/data/test128M.tbl` limit 6);
> +---++
> | Fragment  | Number of records written  |
> +---++
> | 0_0   | 6  |
> +---++
> 1 row selected (0.517 seconds)
> 0: jdbc:drill:zk=local> select count(*) from test1 join test on test1.mycol1 
> = test.mycol;
> +-+
> | EXPR$0  |
> +-+
> | 0   |
> +-+
> 1 row selected (140.098 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6880) Hash-Join: Many null keys on the build side form a long linked chain in the Hash Table

2018-12-04 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-6880:
---

 Summary: Hash-Join: Many null keys on the build side form a long 
linked chain in the Hash Table
 Key: DRILL-6880
 URL: https://issues.apache.org/jira/browse/DRILL-6880
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Relational Operators
Affects Versions: 1.14.0
Reporter: Boaz Ben-Zvi
 Fix For: 1.16.0


When building the Hash Table for the Hash-Join, each new key is matched with an 
existing key (same bucket) by calling the generated method 
`isKeyMatchInternalBuild`, which compares the two. However when both keys are 
null, the method returns *false* (meaning not-equal; i.e. it is a new key), 
thus the new key is added into the list following the old key. When a third 
null key is found, it would be matched with the prior two, and added as well. 
Etc etc ...

This way many null values would perform checks at order N^2 / 2.

Suggested improvement: The generated code should return a third result, meaning 
"two null keys". Then in case of Inner or Left joins all the duplicate nulls 
can be discarded.

Below is a simple example, note the time difference between non-null and the 
all-nulls tables (also instrumentation showed that for nulls, the method above 
was called 1249975000 times!!)
{code:java}
0: jdbc:drill:zk=local> use dfs.tmp;
0: jdbc:drill:zk=local> create table test as (select cast(null as int) mycol 
from 
 dfs.`/data/test128M.tbl` limit 5);
0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
from 
 dfs.`/data/test128M.tbl` limit 6);
0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2 
from dfs.`/data/test128M.tbl` limit 5);
0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 = 
test2.mycol2;
+-+
| EXPR$0  |
+-+
| 0   |
+-+
1 row selected (0.443 seconds)
0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
from dfs.`/data/test128M.tbl` limit 6);
+---++
| Fragment  | Number of records written  |
+---++
| 0_0   | 6  |
+---++
1 row selected (0.517 seconds)
0: jdbc:drill:zk=local> select count(*) from test1 join test on test1.mycol1 = 
test.mycol;
+-+
| EXPR$0  |
+-+
| 0   |
+-+
1 row selected (140.098 seconds)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6767) Simplify transfer of information from the planner to the operators

2018-11-26 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6767:

Description: 
Currently little specific information known to the planner is passed to the 
operators. For example, see the `joinType` parameter passed to the Join 
operators (specifying whether this is a LEFT, RIGHT, INNER of FULL join). 
 The relevant code passes this information explicitly via the constructors' 
signature (e.g., see HashJoinPOP, AbstractJoinPop, etc), and uses specific 
fields for this information, and affects all the test code using it, etc.
 In the near future many more such "pieces of information" will possibly be 
added to Drill, including:
 (1) Is this a Semi (or Anti-Semi) join.
 (2) `joinControl`
 (3) `isRowKeyJoin`
 (4) `isBroadcastJoin`
 (5) Which join columns are not needed (DRILL-6758)
 (6) Is this operator positioned between Lateral and UnNest.
 (7) For Hash-Agg: Which phase (already implemented).
 (8) For Hash-Agg: Perform COUNT  (DRILL-6836) 

Each addition of such information would require a significant code change, and 
add some code clutter.

*Suggestion*: Instead pass a single object containing all the needed planner 
information. So the next time another field is added, only that object needs to 
be changed. (Ideally the whole plan could be passed, and then each operator 
could poke and pick its needed fields)

  was:
Currently little specific information known to the planner is passed to the 
operators. For example, see the `joinType` parameter passed to the Join 
operators (specifying whether this is a LEFT, RIGHT, INNER of FULL join). 
 The relevant code passes this information explicitly via the constructors' 
signature (e.g., see HashJoinPOP, AbstractJoinPop, etc), and uses specific 
fields for this information, and affects all the test code using it, etc.
 In the near future many more such "pieces of information" will possibly be 
added to Drill, including:
 (1) Is this a Semi (or Anti-Semi) join.
 (2) `joinControl`
 (3) `isRowKeyJoin`
 (4) `isBroadcastJoin`
 (5) Which join columns are not needed (DRILL-6758)
 (6) Is this operator positioned between Lateral and UnNest.
 (7) For Hash-Agg: Which phase (already implemented).

Each addition of such information would require a significant code change, and 
add some code clutter.

*Suggestion*: Instead pass a single object containing all the needed planner 
information. So the next time another field is added, only that object needs to 
be changed. (Ideally the whole plan could be passed, and then each operator 
could poke and pick its needed fields)




> Simplify transfer of information from the planner to the operators
> --
>
> Key: DRILL-6767
> URL: https://issues.apache.org/jira/browse/DRILL-6767
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators, Query Planning  
> Optimization
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: 1.16.0
>
>
> Currently little specific information known to the planner is passed to the 
> operators. For example, see the `joinType` parameter passed to the Join 
> operators (specifying whether this is a LEFT, RIGHT, INNER of FULL join). 
>  The relevant code passes this information explicitly via the constructors' 
> signature (e.g., see HashJoinPOP, AbstractJoinPop, etc), and uses specific 
> fields for this information, and affects all the test code using it, etc.
>  In the near future many more such "pieces of information" will possibly be 
> added to Drill, including:
>  (1) Is this a Semi (or Anti-Semi) join.
>  (2) `joinControl`
>  (3) `isRowKeyJoin`
>  (4) `isBroadcastJoin`
>  (5) Which join columns are not needed (DRILL-6758)
>  (6) Is this operator positioned between Lateral and UnNest.
>  (7) For Hash-Agg: Which phase (already implemented).
>  (8) For Hash-Agg: Perform COUNT  (DRILL-6836) 
> Each addition of such information would require a significant code change, 
> and add some code clutter.
> *Suggestion*: Instead pass a single object containing all the needed planner 
> information. So the next time another field is added, only that object needs 
> to be changed. (Ideally the whole plan could be passed, and then each 
> operator could poke and pick its needed fields)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6828) Hit UnrecognizedPropertyException when run tpch queries

2018-11-26 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699600#comment-16699600
 ] 

Boaz Ben-Zvi commented on DRILL-6828:
-

Can this error be reproduced ?  The code change looks correct, and the Json 
Property "outgoingBatchSize" should work OK. How come this error has not shown 
elsewhere since Nov. 2nd ?

Maybe there was some problem with that build (on Nov 2) ?

 

> Hit UnrecognizedPropertyException when run tpch queries
> ---
>
> Key: DRILL-6828
> URL: https://issues.apache.org/jira/browse/DRILL-6828
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.15.0
> Environment: RHEL 7,   Apache Drill commit id: 
> 18e09a1b1c801f2691a05ae7db543bf71874cfea
>Reporter: Dechang Gu
>Assignee: Boaz Ben-Zvi
>Priority: Blocker
> Fix For: 1.15.0
>
>
> Installed Apache Drill 1.15.0 commit id: 
> 18e09a1b1c801f2691a05ae7db543bf71874cfea DRILL-6763: Codegen optimization of 
> SQL functions with constant values(\#1481)
> Hit the following errors:
> {code}
> java.sql.SQLException: SYSTEM ERROR: UnrecognizedPropertyException: 
> Unrecognized field "outgoingBatchSize" (class 
> org.apache.drill.exec.physical.config.HashPartitionSender), not marked as 
> ignorable (9 known properties: "receiver-major-fragment", 
> "initialAllocation", "expr", "userName", "@id", "child", "cost", 
> "destinations", "maxAllocation"])
>  at [Source: (StringReader); line: 1000, column: 29] (through reference 
> chain: 
> org.apache.drill.exec.physical.config.HashPartitionSender["outgoingBatchSize"])
> Fragment 3:175
> Please, refer to logs for more information.
> [Error Id: cc023cdb-9a46-4edd-ad0b-6da1e9085291 on ucs-node6.perf.lab:31010]
> at 
> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:528)
> at 
> org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:600)
> at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:1288)
> at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:61)
> at 
> org.apache.calcite.avatica.AvaticaConnection$1.execute(AvaticaConnection.java:667)
> at 
> org.apache.drill.jdbc.impl.DrillMetaImpl.prepareAndExecute(DrillMetaImpl.java:1109)
> at 
> org.apache.drill.jdbc.impl.DrillMetaImpl.prepareAndExecute(DrillMetaImpl.java:1120)
> at 
> org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:675)
> at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.prepareAndExecuteInternal(DrillConnectionImpl.java:196)
> at 
> org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:156)
> at 
> org.apache.calcite.avatica.AvaticaStatement.executeQuery(AvaticaStatement.java:227)
> at PipSQueak.executeQuery(PipSQueak.java:289)
> at PipSQueak.runTest(PipSQueak.java:104)
> at PipSQueak.main(PipSQueak.java:477)
> Caused by: org.apache.drill.common.exceptions.UserRemoteException: SYSTEM 
> ERROR: UnrecognizedPropertyException: Unrecognized field "outgoingBatchSize" 
> (class org.apache.drill.exec.physical.config.HashPartitionSender), not marked 
> as ignorable (9 known properties: "receiver-major-fragment", 
> "initialAllocation", "expr", "userName", "@id", "child", "cost", 
> "destinations", "maxAllocation"])
>  at [Source: (StringReader); line: 1000, column: 29] (through reference 
> chain: 
> org.apache.drill.exec.physical.config.HashPartitionSender["outgoingBatchSize"])
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6864) Root POM: Update the git-commit-id plugin

2018-11-21 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695379#comment-16695379
 ] 

Boaz Ben-Zvi commented on DRILL-6864:
-

Also the newer versions (since 2.2.3) allow for an *mvn* command-line option to 
skip this plugin (e.g. not needed for our development builds): 
{code}-Dmaven.gitcommitid.skip=true{code}


 

> Root POM: Update the git-commit-id plugin
> -
>
> Key: DRILL-6864
> URL: https://issues.apache.org/jira/browse/DRILL-6864
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Tools, Build  Test
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>Priority: Minor
> Fix For: 1.15.0
>
>
>    The Maven git-commit-id plugin is of version 2.1.9, which is 4.5 years 
> old. Executing this plugin seems to take a significant portion of the mvn 
> build time. Newer versions run more than twice as fast (see below).
>   Suggestion: Upgrade to the latest (2.2.5), to shorten the Drill mvn build 
> time.
> Here are the run times with our *current (2.1.9)* version:
> {code:java}
> [INFO]   git-commit-id-plugin:revision (for-jars) . [25.320s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [24.255s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [22.821s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [32.889s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [34.557s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [26.085s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [46.135s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [72.811s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [45.956s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [18.223s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [19.841s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [50.146s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [30.993s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [32.839s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [33.852s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [23.562s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [25.333s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [24.737s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [19.098s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [46.245s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [40.350s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [34.610s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [78.756s]
> [INFO]   git-commit-id-plugin:revision (for-source-tarball) ... [52.551s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [10.940s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [24.573s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [24.404s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [43.501s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [25.041s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [39.149s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [40.310s]
> {code}
> And here are the run times with a newer (2.2.4) version:
> {code:java}
> [INFO]   git-commit-id-plugin:revision (for-jars) . [6.964s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [18.732s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [7.441s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [8.146s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [6.404s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [7.837s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [9.788s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [9.136s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [19.607s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [9.289s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [8.046s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [8.268s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [7.868s]
> [INFO]   git-commit-id-plugin:revision (for-jars) . [10.750s]
> [INFO]   

[jira] [Created] (DRILL-6864) Root POM: Update the git-commit-id plugin

2018-11-20 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-6864:
---

 Summary: Root POM: Update the git-commit-id plugin
 Key: DRILL-6864
 URL: https://issues.apache.org/jira/browse/DRILL-6864
 Project: Apache Drill
  Issue Type: Improvement
  Components: Tools, Build  Test
Affects Versions: 1.14.0
Reporter: Boaz Ben-Zvi
Assignee: Boaz Ben-Zvi
 Fix For: 1.15.0


   The Maven git-commit-id plugin is of version 2.1.9, which is 4.5 years old. 
Executing this plugin seems to take a significant portion of the mvn build 
time. Newer versions run more than twice as fast (see below).

  Suggestion: Upgrade to the latest (2.2.5), to shorten the Drill mvn build 
time.

Here are the run times with our *current (2.1.9)* version:
{code:java}
[INFO]   git-commit-id-plugin:revision (for-jars) . [25.320s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [24.255s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [22.821s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [32.889s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [34.557s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [26.085s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [46.135s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [72.811s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [45.956s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [18.223s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [19.841s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [50.146s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [30.993s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [32.839s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [33.852s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [23.562s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [25.333s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [24.737s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [19.098s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [46.245s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [40.350s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [34.610s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [78.756s]
[INFO]   git-commit-id-plugin:revision (for-source-tarball) ... [52.551s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [10.940s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [24.573s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [24.404s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [43.501s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [25.041s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [39.149s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [40.310s]
{code}
And here are the run times with a newer (2.2.4) version:
{code:java}
[INFO]   git-commit-id-plugin:revision (for-jars) . [6.964s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [18.732s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [7.441s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [8.146s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [6.404s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [7.837s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [9.788s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [9.136s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [19.607s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [9.289s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [8.046s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [8.268s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [7.868s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [10.750s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [8.558s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [11.267s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [15.696s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [9.446s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [6.187s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [24.806s]
[INFO]   git-commit-id-plugin:revision (for-jars) . [14.591s]
[INFO]   

[jira] [Commented] (DRILL-6863) Drop table is not working on amazon S3

2018-11-20 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693882#comment-16693882
 ] 

Boaz Ben-Zvi commented on DRILL-6863:
-

Could this be the same issue as DRILL-4896  ("After a failed CTAS, the table 
both exists and does not exist") ??

 

> Drop table is not working on amazon S3
> --
>
> Key: DRILL-6863
> URL: https://issues.apache.org/jira/browse/DRILL-6863
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Denys Ordynskiy
>Priority: Major
>
> Parquet table was created using CTAS on S3.
> Request "drop table 
> s3.tmp.`/drill/transitive_closure/DRILL_6173_filterPushdown/tab1`"
> returns successfully response:
> "Table [/drill/transitive_closure/DRILL_6173_filterPushdown/tab1] dropped"
>  
> *Actual result:*
> Drill did not delete the table `tab1`, files still present on S3 storage.
> In "drillbit.out":
> {code:java}
> 23:54:49.661 [2416095b-6544-fc80-0dfa-2fc19c4dee0e:foreman] ERROR 
> o.apache.hadoop.fs.s3a.S3AFileSystem - rename: src not found 
> /drill/transitive_closure/DRILL_6173_filterPushdown/tab1
> {code}
>  
> *Expected result:*
> Drill should delete table from S3 storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6861) Hash-Join: Spilled partitions are skipped following an empty probe side

2018-11-19 Thread Boaz Ben-Zvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-6861:

Description: 
     Following DRILL-6755 (_Avoid building a hash table when the probe side is 
empty_) - The special case of an empty spilled probe-partition was not handled. 
 When such a case happens, the Hash-Join terminates early (returns NONE) and 
the remaining partitions are not processed/returned (which may lead to 
incorrect results).

  A test case - force tpcds/query95 to spill (sf1) :
{code:java}
0: jdbc:drill:zk=local> alter system set `exec.hashjoin.max_batches_in_memory` 
= 40;
+---+---+
|  ok   |summary|
+---+---+
| true  | exec.hashjoin.max_batches_in_memory updated.  |
+---+---+
1 row selected (1.325 seconds)
0: jdbc:drill:zk=local> WITH ws_wh AS
. . . . . . . . . . . > (
. . . . . . . . . . . >SELECT ws1.ws_order_number,
. . . . . . . . . . . >   ws1.ws_warehouse_sk wh1,
. . . . . . . . . . . >   ws2.ws_warehouse_sk wh2
. . . . . . . . . . . >FROM   dfs.`/data/tpcds/sf1/parquet/web_sales` 
ws1,
. . . . . . . . . . . >   dfs.`/data/tpcds/sf1/parquet/web_sales` 
ws2
. . . . . . . . . . . >WHERE  ws1.ws_order_number = ws2.ws_order_number
. . . . . . . . . . . >ANDws1.ws_warehouse_sk <> 
ws2.ws_warehouse_sk)
. . . . . . . . . . . > SELECT
. . . . . . . . . . . >  Count(DISTINCT ws1.ws_order_number) AS `order 
count` ,
. . . . . . . . . . . >  Sum(ws1.ws_ext_ship_cost)   AS `total 
shipping cost` ,
. . . . . . . . . . . >  Sum(ws1.ws_net_profit)  AS `total 
net profit`
. . . . . . . . . . . > FROM dfs.`/data/tpcds/sf1/parquet/web_sales` ws1 ,
. . . . . . . . . . . >  dfs.`/data/tpcds/sf1/parquet/date_dim` dd,
. . . . . . . . . . . >  dfs.`/data/tpcds/sf1/parquet/customer_address` 
ca,
. . . . . . . . . . . >  dfs.`/data/tpcds/sf1/parquet/web_site` wbst
. . . . . . . . . . . > WHEREdd.d_date BETWEEN '2000-04-01' AND  (
. . . . . . . . . . . >   Cast('2000-04-01' AS DATE) + INTERVAL 
'60' day)
. . . . . . . . . . . > AND  ws1.ws_ship_date_sk = dd.d_date_sk
. . . . . . . . . . . > AND  ws1.ws_ship_addr_sk = ca.ca_address_sk
. . . . . . . . . . . > AND  ca.ca_state = 'IN'
. . . . . . . . . . . > AND  ws1.ws_web_site_sk = wbst.web_site_sk
. . . . . . . . . . . > AND  wbst.web_company_name = 'pri'
. . . . . . . . . . . > AND  ws1.ws_order_number IN
. . . . . . . . . . . >  (
. . . . . . . . . . . > SELECT ws_wh.ws_order_number
. . . . . . . . . . . > FROM   ws_wh)
. . . . . . . . . . . > AND  ws1.ws_order_number IN
. . . . . . . . . . . >  (
. . . . . . . . . . . > SELECT wr.wr_order_number
. . . . . . . . . . . > FROM   
dfs.`/data/tpcds/sf1/parquet/web_returns` wr,
. . . . . . . . . . . >ws_wh
. . . . . . . . . . . > WHERE  wr.wr_order_number = 
ws_wh.ws_order_number)
. . . . . . . . . . . > ORDER BY count(DISTINCT ws1.ws_order_number)
. . . . . . . . . . . > LIMIT 100;
+--+--+-+
| order count  | total shipping cost  |  total net profit   |
+--+--+-+
| 17   | 38508.1305   | 20822.3   |
+--+--+-+
1 row selected (105.621 seconds)
{code}
The correct results should be:
{code:java}
+--+--+-+
| order count  | total shipping cost  |  total net profit   |
+--+--+-+
| 34   | 63754.72 | 15919.0098  |
+--+--+-+
{code}

  was:
     Following DRILL-6755 (_Avoid building a hash table when the probe side is 
empty_) - The special case of an empty spilled probe-partition was not handled. 
 When such a case happens, the Hash-Join terminates early (returns NONE) and 
the remaining partitions are not processed/returned (which may lead to 
incorrect results).

  A test case - force tpcds/query95 to spill :
{code:java}
0: jdbc:drill:zk=local> alter system set `exec.hashjoin.max_batches_in_memory` 
= 40;
+---+---+
|  ok   |summary|
+---+---+
| true  | exec.hashjoin.max_batches_in_memory updated.  |
+---+---+
1 row selected (1.325 seconds)
0: jdbc:drill:zk=local> WITH ws_wh AS
. . . . . . . . . . 

[jira] [Created] (DRILL-6861) Hash-Join: Spilled partitions are skipped following an empty probe side

2018-11-19 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-6861:
---

 Summary: Hash-Join: Spilled partitions are skipped following an 
empty probe side
 Key: DRILL-6861
 URL: https://issues.apache.org/jira/browse/DRILL-6861
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 1.14.0
Reporter: Boaz Ben-Zvi
Assignee: Boaz Ben-Zvi
 Fix For: 1.15.0


     Following DRILL-6755 (_Avoid building a hash table when the probe side is 
empty_) - The special case of an empty spilled probe-partition was not handled. 
 When such a case happens, the Hash-Join terminates early (returns NONE) and 
the remaining partitions are not processed/returned (which may lead to 
incorrect results).

  A test case - force tpcds/query95 to spill :
{code:java}
0: jdbc:drill:zk=local> alter system set `exec.hashjoin.max_batches_in_memory` 
= 40;
+---+---+
|  ok   |summary|
+---+---+
| true  | exec.hashjoin.max_batches_in_memory updated.  |
+---+---+
1 row selected (1.325 seconds)
0: jdbc:drill:zk=local> WITH ws_wh AS
. . . . . . . . . . . > (
. . . . . . . . . . . >SELECT ws1.ws_order_number,
. . . . . . . . . . . >   ws1.ws_warehouse_sk wh1,
. . . . . . . . . . . >   ws2.ws_warehouse_sk wh2
. . . . . . . . . . . >FROM   dfs.`/data/tpcds/sf1/parquet/web_sales` 
ws1,
. . . . . . . . . . . >   dfs.`/data/tpcds/sf1/parquet/web_sales` 
ws2
. . . . . . . . . . . >WHERE  ws1.ws_order_number = ws2.ws_order_number
. . . . . . . . . . . >ANDws1.ws_warehouse_sk <> 
ws2.ws_warehouse_sk)
. . . . . . . . . . . > SELECT
. . . . . . . . . . . >  Count(DISTINCT ws1.ws_order_number) AS `order 
count` ,
. . . . . . . . . . . >  Sum(ws1.ws_ext_ship_cost)   AS `total 
shipping cost` ,
. . . . . . . . . . . >  Sum(ws1.ws_net_profit)  AS `total 
net profit`
. . . . . . . . . . . > FROM dfs.`/data/tpcds/sf1/parquet/web_sales` ws1 ,
. . . . . . . . . . . >  dfs.`/data/tpcds/sf1/parquet/date_dim` dd,
. . . . . . . . . . . >  dfs.`/data/tpcds/sf1/parquet/customer_address` 
ca,
. . . . . . . . . . . >  dfs.`/data/tpcds/sf1/parquet/web_site` wbst
. . . . . . . . . . . > WHEREdd.d_date BETWEEN '2000-04-01' AND  (
. . . . . . . . . . . >   Cast('2000-04-01' AS DATE) + INTERVAL 
'60' day)
. . . . . . . . . . . > AND  ws1.ws_ship_date_sk = dd.d_date_sk
. . . . . . . . . . . > AND  ws1.ws_ship_addr_sk = ca.ca_address_sk
. . . . . . . . . . . > AND  ca.ca_state = 'IN'
. . . . . . . . . . . > AND  ws1.ws_web_site_sk = wbst.web_site_sk
. . . . . . . . . . . > AND  wbst.web_company_name = 'pri'
. . . . . . . . . . . > AND  ws1.ws_order_number IN
. . . . . . . . . . . >  (
. . . . . . . . . . . > SELECT ws_wh.ws_order_number
. . . . . . . . . . . > FROM   ws_wh)
. . . . . . . . . . . > AND  ws1.ws_order_number IN
. . . . . . . . . . . >  (
. . . . . . . . . . . > SELECT wr.wr_order_number
. . . . . . . . . . . > FROM   
dfs.`/data/tpcds/sf1/parquet/web_returns` wr,
. . . . . . . . . . . >ws_wh
. . . . . . . . . . . > WHERE  wr.wr_order_number = 
ws_wh.ws_order_number)
. . . . . . . . . . . > ORDER BY count(DISTINCT ws1.ws_order_number)
. . . . . . . . . . . > LIMIT 100;
+--+--+-+
| order count  | total shipping cost  |  total net profit   |
+--+--+-+
| 17   | 38508.1305   | 20822.3   |
+--+--+-+
1 row selected (105.621 seconds)
{code}
The correct results should be:
{code:java}
+--+--+-+
| order count  | total shipping cost  |  total net profit   |
+--+--+-+
| 34   | 63754.72 | 15919.0098  |
+--+--+-+
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6859) BETWEEN dates with a slightly malformed DATE string returns false

2018-11-15 Thread Boaz Ben-Zvi (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688859#comment-16688859
 ] 

Boaz Ben-Zvi commented on DRILL-6859:
-

[~arina] - can you take a look ?

 

> BETWEEN dates with a slightly malformed DATE string returns false
> -
>
> Key: DRILL-6859
> URL: https://issues.apache.org/jira/browse/DRILL-6859
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.14.0
>Reporter: Boaz Ben-Zvi
>Priority: Major
> Fix For: Future
>
>
> (This may be a Calcite issue )
> In the following query using BETWEEN with dates, the "month" is specified as 
> "4", instead of "04", which causes the BETWEEN clause to evaluate to FALSE. 
> Note that rewriting the clause with less-than etc. does work correctly.
> {code:java}
> 0: jdbc:drill:zk=local> select count(*) from `date_dim` dd where dd.d_date 
> BETWEEN '2000-4-01' and ( Cast('2000-4-01' AS DATE) + INTERVAL '60' day) ;
> +-+
> | EXPR$0  |
> +-+
> | 0   |
> +-+
> 1 row selected (0.184 seconds)
> 0: jdbc:drill:zk=local> select count(*) from `date_dim` dd where dd.d_date 
> BETWEEN '2000-04-01' and ( Cast('2000-4-01' AS DATE) + INTERVAL '60' day) 
> limit 10;
> +-+
> | EXPR$0  |
> +-+
> | 61  |
> +-+
> 1 row selected (0.209 seconds)
> 0: jdbc:drill:zk=local> select count(*) from `date_dim` dd where dd.d_date >= 
> '2000-4-01' and dd.d_date <= '2000-5-31';
> +-+
> | EXPR$0  |
> +-+
> | 61  |
> +-+
> 1 row selected (0.227 seconds)
> {code}
> The physical plan for the second (good) case implements the BETWEEN clause 
> with a FILTER on top of the scanner. For the first (failed) case, there is a 
> "limit 0" on top of the scanner.
> (This query was extracted from TPC-DS 95, used over Parquet files).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6860) SqlLine: EXPLAIN produces very long header lines

2018-11-15 Thread Boaz Ben-Zvi (JIRA)
Boaz Ben-Zvi created DRILL-6860:
---

 Summary: SqlLine: EXPLAIN produces very long header lines
 Key: DRILL-6860
 URL: https://issues.apache.org/jira/browse/DRILL-6860
 Project: Apache Drill
  Issue Type: Bug
  Components: Client - CLI
Affects Versions: 1.14.0
Reporter: Boaz Ben-Zvi
Assignee: Arina Ielchiieva
 Fix For: 1.15.0


Maybe a result of upgrading to SqlLine 1.5.0 (DRILL-3853 - PR #1462), the 
header dividing lines displayed when using EXPLAIN became very long:

{code}
0: jdbc:drill:zk=local> explain plan for select count(*) from 
dfs.`/data/tpcds/sf1/parquet/date_dim`;
+-+---+
|   
 text   
  | 





json





  |
+-+---+
| 00-00Screen
00-01  Project(EXPR$0=[$0])
00-02DirectScan(groupscan=[files = 
[/data/tpcds/sf1/parquet/date_dim/0_0_0.parquet], numFiles = 1, 
DynamicPojoRecordReader{records = [[73049]]}])
  | {
  "head" : {
"version" : 1,
"generator" : {
  "type" : "ExplainHandler",
  "info" : ""
},
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ {
  "kind" : "BOOLEAN",
  "accessibleScopes" : "ALL",
  "name" : "planner.enable_nljoin_for_scalar_only",
  "bool_val" : true,
  "scope" : "SESSION"
} ],
"queue" : 0,
"hasResourcePlan" : false,
"resultMode" : "EXEC"
  },
  "graph" : [ {
"pop" : "metadata-direct-scan",
"@id" : 2,
"cost" : 1.0
  }, {
"pop" : "project",
"@id" : 1,
"exprs" : [ {
  "ref" : "`EXPR$0`",
  "expr" : "`count0$EXPR$0`"
  

  1   2   3   4   >