Re: rewriting table for joining logical partitions

Jinfeng Ni Thu, 02 Jun 2016 22:59:59 -0700

sounds to me that what you want to achieve is similar to view
expansion, but there is one big difference.


You want to internally rewrite :
SELECT * FROM EVENTS

==>

SELECT * FROM dfs.`/drill/table1.json` AS t1 JOIN
dfs.`/drill/table2.json`AS t2 ON t1.id = t2.id

This is similar to

CREATE VIEW EVENTS as
SELECT * FROM dfs.`/drill/table1.json` AS t1 JOIN
dfs.`/drill/table2.json`AS t2 ON t1.id = t2.id;

MyTable.toRel() is kind of doing the work of view expansion in [1]

However, for view, the row type of view is cleanly defined, before
view expansion.  For the above "EVENT" view,
it would have t1.* and t2.*. However, in your case, you do not know
the row type of "EVENT" table, until after "EVENT" is expanded. As
such, the top project has only *, in stead of expanding into t1.* and
t2.*.

It's quite challenging to make * column query work against dynamic
table (btw. some of those planner logic is in CALCITE-1150 [2] ). In
your case, it's even more challenging, since the dynamic table would
be expanded into a list of tables.

Why can't you use view / view expansion?  If the list of tables is
kind of dynamic, you may use some script to re-generate the view DDL.
This way, you do not have to solve a bunch of unsolved problems.





[1] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillViewTable.java#L60
[2] https://issues.apache.org/jira/browse/CALCITE-1150

On Thu, Jun 2, 2016 at 12:19 AM, Jesse Yates <jesse.k.ya...@gmail.com> wrote:
> Its definitely getting closer, Thanks Jinfeng!
>
> I end up with this set of plans
> <https://gist.github.com/jyates/f11eb44a44af715b483859f497b9ea89#file-expanded-table-planning>.
> However, the only column that is returned is * and it contains just the
> joined id column, rather than the full column expansion.
>
> This is in contrast to the raw file join physical plan
> <https://gist.github.com/jyates/f11eb44a44af715b483859f497b9ea89#file-raw-sql-physical-plan>
> which has an expanded projection condition.
>
> I'm not sure how I can force the top projection to select correctly. Do i
> inject it as a wrapper scan around the created join or create a new rule
> that takes (project, join) and replaces it with the correct project
> condition? Naturally, I'd have to type that rule/join to not just be the
> logical so it can convert only this generated base case.
>
> Thanks!
> --Jesse
>
> ps. Updated the gist
> <https://gist.github.com/jyates/f11eb44a44af715b483859f497b9ea89#file-drill-to-rel-table-example-java>
> to cover the offset changes for the degenerate 2 table case.
>
> On Wed, Jun 1, 2016 at 10:26 PM Jinfeng Ni <jinfengn...@gmail.com> wrote:
>
>> Seems to me that the reason you run into problem when build the join
>> condition is that you are trying to compose the condition while adding
>> the fields on-the-fly in [1]. This is different from what Drill is
>> doing in Calcite library. For Drill, before construct the join
>> condition (RexNode) t1.id = t2.id, SqlValidator will make sure those
>> fields exists in the table.  For dynamic table, such check will lead
>> to one additional field in table's rowType. After validation, when
>> Calcite builds join condition in SqlToRelConverter, it already knows
>> the field list for each input table (at that time, table's rowType
>> should be immutable), and hence it can adjust the reference to the
>> right table field, by adding the # of LHS fields to the index. That
>> is, =($1, $1) would become =($1, $3).
>>
>> I feel you may follow such logic, by calling field(table1, fieldName)
>> on each side of join, before construct the join condition.
>>
>>
>> [1]
>> https://gist.github.com/jyates/f11eb44a44af715b483859f497b9ea89#file-drill-to-rel-table-example-java-L94-L95
>>
>> On Wed, Jun 1, 2016 at 9:50 PM, Jesse Yates <jesse.k.ya...@gmail.com>
>> wrote:
>> > I'm building a layer "on top"[1] that hides the details of accessing
>> > underlying "partitions" (stored as individual tables) and picks the right
>> > tables based on the query (time partitioned, so prune tables which won't
>> > fulfill the query).
>> >
>> > More concretely, lets take a set of files, each one fulfilling the role
>> of
>> > "table":
>> > /drill
>> >    /table1.json
>> >    /table2.json
>> > or in drill parlance, *dfs.`/drill/table1.json`* and
>> *dfs.`/drill/table2.json`.
>> > T*he query that I want to make externally is:* SELECT * FROM EVENTS *and
>> I
>> > want to translate that internally to
>> >
>> > *SELECT * FROM dfs.`/drill/table1.json` AS t1 JOIN
>> > dfs.`/drill/table2.json`AS t2 ON t1.id <http://t1.id> = t2.id
>> > <http://t2.id>.*Executing that expanded (second) query from the
>> top-level
>> > java.sql.Connection executes fine, hence my thoughts about the sub-table
>> > not actually being found.
>> >
>> > Here is what I am trying to do in my table handler - the 'EVENTS' table (
>> > gist <https://gist.github.com/jyates/f11eb44a44af715b483859f497b9ea89>).
>> >
>> > Does that help at all?
>> >
>> > Thanks,
>> > Jesse
>> >
>> > [1] Where on top in this case includes components inside Drill.
>> >
>> > On Wed, Jun 1, 2016 at 9:27 PM Jinfeng Ni <jinfengn...@gmail.com> wrote:
>> >
>> >> I'm not sure if I understand your problem correctly. Are you trying to
>> >> build some non-SQL interface on top of Drill, to join a set of dynamic
>> >> tables? Can you give more concrete example?
>> >>
>> >> When Drill handles join over two dynamic tables,  except for * column
>> >> query, the dynamic tables have a list of fields defined, since those
>> >> fields are referred in the query (even though the planner does not
>> >> know each field's type). Therefore, the join condition will never be
>> >> =($1,$1); it would be resolved to reference to left/right tables's
>> >> fields.
>> >>
>> >>
>> >>
>> >> On Wed, Jun 1, 2016 at 7:43 PM, Jesse Yates <jya...@apache.org> wrote:
>> >> > Hi all,
>> >> >
>> >> > I'm trying to rewrite a query of a table (ala Table#toRel) to join a
>> set
>> >> of
>> >> > dynamic (sub-)tables on a couple of known columns but am getting stuck
>> >> > building the condition. The sub-tables are not part of the original
>> >> query,
>> >> > but rather selected on-the-fly at logical query time.
>> >> >
>> >> > I can't use UNION-ALL because, outside of a couple of known columns,
>> the
>> >> > remainder are completely dynamic.
>> >> >
>> >> > Using RelBuilder I can construct the join via a series of scans and
>> then
>> >> > joins on the known fields[1]. However, this only creates RelInputRefs
>> >> which
>> >> > are not at all associated with the current ref numbering because the
>> the
>> >> > sub-tables are not present in the original query. Thus, we get
>> conditions
>> >> > like:
>> >> > (=($1,$1), which looks appears TRUE but actually should reference the
>> >> > left/right tables' fields.
>> >> >
>> >> > I tried playing around with RelRangeRef and manually managing the
>> field
>> >> > offsets in query (similar to BlackBoard), but that call gets
>> translated
>> >> > into an actually ALWAYS-TRUE condition and also fails the
>> >> > JoinUtil#checkCartesianJoin case.
>> >> >
>> >> > If I construct the query via standard SQL at the top level (using
>> known
>> >> > tables), everything works fine, I think because of the ref-numbering
>> to
>> >> > which I cannot get access in #toRel().
>> >> >
>> >> > Any thoughts on the right way to go about this?
>> >> >
>> >> > Thanks much,
>> >> > Jesse Yates
>> >> >
>> >> >
>> >> > [1] Actually, this meant digging into RelBuilder
>> >> > <
>> >>
>> https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/tools/RelBuilder.java#L309
>> >> >
>> >> > since dynamic tables require the field names to already be set and
>> then
>> >> > uses the column name from the list to get the field index
>> >>
>>

Re: rewriting table for joining logical partitions

Reply via email to