[jira] [Commented] (DRILL-6606) Hash Join returns incorrect data types when joining subqueries with limit 0

ASF GitHub Bot (JIRA) Thu, 19 Jul 2018 15:23:24 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549949#comment-16549949
 ]


ASF GitHub Bot commented on DRILL-6606:
---------------------------------------

ilooner commented on issue #1384: DRILL-6606: Fixed bug in HashJoin that caused 
it not to return OK_NEW_SCHEMA in some cases.
URL: https://github.com/apache/drill/pull/1384#issuecomment-406432426
 
 
   Thanks for the +1 . 
   
   With respect to your comment, calling prefetchFirstBatchFromBothSides from 
buildSchema was actually the source of the problem. Doing so would trigger the 
operator state to be BatchState.FIRST after calling buildSchema which would 
cause an **OK_SCHEMA** to NOT be sent. This then cause downstream operators to 
never build a correct schema and return incorrect data types in some cases. 
That was the crux of the issue.
   
   This change fixes that issue by separating prefetching data to two phases:
   
     - Schema sniffing
     - Data sniffing
   
   The schemas need to be sniffed in the buildSchema call so we can have the 
schema. After sniffing schemas that state of the operator is BUILD_SCHEMA and 
OK_NEW_SCHEMA is emitted. Then data sniffing needs to happen in the call to 
innerNext() after the operator has emitted an OK_NEW_SCHEMA message.
   
   Other binary operators don't have this issue because they don't live within 
their memory limit, and as a consequence do not need to collect statistics 
about the data through sniffing.
   
   Furthermore, doing the sniffing in two stages is not a hack. It is required 
for functional correctness for queries like the one added in the unit test and 
for the reasons described above.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Hash Join returns incorrect data types when joining subqueries with limit 0
> ---------------------------------------------------------------------------
>
>                 Key: DRILL-6606
>                 URL: https://issues.apache.org/jira/browse/DRILL-6606
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Bohdan Kazydub
>            Assignee: Timothy Farkas
>            Priority: Blocker
>             Fix For: 1.14.0
>
>
> PreparedStatement for query
> {code:sql}
> SELECT l.l_quantity, l.l_shipdate, o.o_custkey
> FROM (SELECT * FROM cp.`tpch/lineitem.parquet` LIMIT 0) l
>     JOIN (SELECT * FROM cp.`tpch/orders.parquet` LIMIT 0) o 
>     ON l.l_orderkey = o.o_orderkey
> LIMIT 0
> {code}
>  is created with wrong types (nullable INTEGER) for all selected columns, no 
> matter what their actual type is. This behavior reproduces with hash join 
> only and is very likely to be caused by DRILL-6027 as the query works fine 
> before this feature was implemented.
> To reproduce the problem you can put the aforementioned query into 
> TestPreparedStatementProvider#joinOrderByQuery() test method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6606) Hash Join returns incorrect data types when joining subqueries with limit 0

Reply via email to