Re: rewriting table for joining logical partitions

Jesse Yates Thu, 02 Jun 2016 23:13:11 -0700

Ended up playing around with it a bit more and came to realize that what I
really need is a UNION. The Join ended up working based on your last
suggestion (buggy impl on my side originally).


View expansion sounds like it would work, but the regeneration is less
appealing as I'd like to have low latency access to new tables as they are
available.

To make UNION work I'm working on injecting an "mapping" operator over the
table that takes the unknown fields and maps them into {VARCHAR -> ANY}
object (similar to how the HBase plugin does dynamic fields). Then Union is
just over, in this case, two fields: id, _map which I _think_ should work
alright.

My only concern is then supporting the new operation fully in districtuon
as the Physical/LogicalOperator support for non-internal types  -
specifically the #visit() functionality - seems to be a bit lacking...
unless I'm missing something?

Right now down to the point of having to implement the RecordBatch stuff.
I'll update when I make some notable headway.

Thanks Jinfeng!

On Thu, Jun 2, 2016 at 10:59 PM Jinfeng Ni <jinfengn...@gmail.com> wrote:

> sounds to me that what you want to achieve is similar to view
> expansion, but there is one big difference.
>
> You want to internally rewrite :
> SELECT * FROM EVENTS
>
> ==>
>
> SELECT * FROM dfs.`/drill/table1.json` AS t1 JOIN
> dfs.`/drill/table2.json`AS t2 ON t1.id = t2.id
>
> This is similar to
>
> CREATE VIEW EVENTS as
> SELECT * FROM dfs.`/drill/table1.json` AS t1 JOIN
> dfs.`/drill/table2.json`AS t2 ON t1.id = t2.id;
>
> MyTable.toRel() is kind of doing the work of view expansion in [1]
>
> However, for view, the row type of view is cleanly defined, before
> view expansion.  For the above "EVENT" view,
> it would have t1.* and t2.*. However, in your case, you do not know
> the row type of "EVENT" table, until after "EVENT" is expanded. As
> such, the top project has only *, in stead of expanding into t1.* and
> t2.*.
>
> It's quite challenging to make * column query work against dynamic
> table (btw. some of those planner logic is in CALCITE-1150 [2] ). In
> your case, it's even more challenging, since the dynamic table would
> be expanded into a list of tables.
>
> Why can't you use view / view expansion?  If the list of tables is
> kind of dynamic, you may use some script to re-generate the view DDL.
> This way, you do not have to solve a bunch of unsolved problems.
>
>
>
>
>
> [1]
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillViewTable.java#L60
> [2] https://issues.apache.org/jira/browse/CALCITE-1150
>
> On Thu, Jun 2, 2016 at 12:19 AM, Jesse Yates <jesse.k.ya...@gmail.com>
> wrote:
> > Its definitely getting closer, Thanks Jinfeng!
> >
> > I end up with this set of plans
> > <
> https://gist.github.com/jyates/f11eb44a44af715b483859f497b9ea89#file-expanded-table-planning
> >.
> > However, the only column that is returned is * and it contains just the
> > joined id column, rather than the full column expansion.
> >
> > This is in contrast to the raw file join physical plan
> > <
> https://gist.github.com/jyates/f11eb44a44af715b483859f497b9ea89#file-raw-sql-physical-plan
> >
> > which has an expanded projection condition.
> >
> > I'm not sure how I can force the top projection to select correctly. Do i
> > inject it as a wrapper scan around the created join or create a new rule
> > that takes (project, join) and replaces it with the correct project
> > condition? Naturally, I'd have to type that rule/join to not just be the
> > logical so it can convert only this generated base case.
> >
> > Thanks!
> > --Jesse
> >
> > ps. Updated the gist
> > <
> https://gist.github.com/jyates/f11eb44a44af715b483859f497b9ea89#file-drill-to-rel-table-example-java
> >
> > to cover the offset changes for the degenerate 2 table case.
> >
> > On Wed, Jun 1, 2016 at 10:26 PM Jinfeng Ni <jinfengn...@gmail.com>
> wrote:
> >
> >> Seems to me that the reason you run into problem when build the join
> >> condition is that you are trying to compose the condition while adding
> >> the fields on-the-fly in [1]. This is different from what Drill is
> >> doing in Calcite library. For Drill, before construct the join
> >> condition (RexNode) t1.id = t2.id, SqlValidator will make sure those
> >> fields exists in the table.  For dynamic table, such check will lead
> >> to one additional field in table's rowType. After validation, when
> >> Calcite builds join condition in SqlToRelConverter, it already knows
> >> the field list for each input table (at that time, table's rowType
> >> should be immutable), and hence it can adjust the reference to the
> >> right table field, by adding the # of LHS fields to the index. That
> >> is, =($1, $1) would become =($1, $3).
> >>
> >> I feel you may follow such logic, by calling field(table1, fieldName)
> >> on each side of join, before construct the join condition.
> >>
> >>
> >> [1]
> >>
> https://gist.github.com/jyates/f11eb44a44af715b483859f497b9ea89#file-drill-to-rel-table-example-java-L94-L95
> >>
> >> On Wed, Jun 1, 2016 at 9:50 PM, Jesse Yates <jesse.k.ya...@gmail.com>
> >> wrote:
> >> > I'm building a layer "on top"[1] that hides the details of accessing
> >> > underlying "partitions" (stored as individual tables) and picks the
> right
> >> > tables based on the query (time partitioned, so prune tables which
> won't
> >> > fulfill the query).
> >> >
> >> > More concretely, lets take a set of files, each one fulfilling the
> role
> >> of
> >> > "table":
> >> > /drill
> >> >    /table1.json
> >> >    /table2.json
> >> > or in drill parlance, *dfs.`/drill/table1.json`* and
> >> *dfs.`/drill/table2.json`.
> >> > T*he query that I want to make externally is:* SELECT * FROM EVENTS
> *and
> >> I
> >> > want to translate that internally to
> >> >
> >> > *SELECT * FROM dfs.`/drill/table1.json` AS t1 JOIN
> >> > dfs.`/drill/table2.json`AS t2 ON t1.id <http://t1.id> = t2.id
> >> > <http://t2.id>.*Executing that expanded (second) query from the
> >> top-level
> >> > java.sql.Connection executes fine, hence my thoughts about the
> sub-table
> >> > not actually being found.
> >> >
> >> > Here is what I am trying to do in my table handler - the 'EVENTS'
> table (
> >> > gist <https://gist.github.com/jyates/f11eb44a44af715b483859f497b9ea89
> >).
> >> >
> >> > Does that help at all?
> >> >
> >> > Thanks,
> >> > Jesse
> >> >
> >> > [1] Where on top in this case includes components inside Drill.
> >> >
> >> > On Wed, Jun 1, 2016 at 9:27 PM Jinfeng Ni <jinfengn...@gmail.com>
> wrote:
> >> >
> >> >> I'm not sure if I understand your problem correctly. Are you trying
> to
> >> >> build some non-SQL interface on top of Drill, to join a set of
> dynamic
> >> >> tables? Can you give more concrete example?
> >> >>
> >> >> When Drill handles join over two dynamic tables,  except for * column
> >> >> query, the dynamic tables have a list of fields defined, since those
> >> >> fields are referred in the query (even though the planner does not
> >> >> know each field's type). Therefore, the join condition will never be
> >> >> =($1,$1); it would be resolved to reference to left/right tables's
> >> >> fields.
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Jun 1, 2016 at 7:43 PM, Jesse Yates <jya...@apache.org>
> wrote:
> >> >> > Hi all,
> >> >> >
> >> >> > I'm trying to rewrite a query of a table (ala Table#toRel) to join
> a
> >> set
> >> >> of
> >> >> > dynamic (sub-)tables on a couple of known columns but am getting
> stuck
> >> >> > building the condition. The sub-tables are not part of the original
> >> >> query,
> >> >> > but rather selected on-the-fly at logical query time.
> >> >> >
> >> >> > I can't use UNION-ALL because, outside of a couple of known
> columns,
> >> the
> >> >> > remainder are completely dynamic.
> >> >> >
> >> >> > Using RelBuilder I can construct the join via a series of scans and
> >> then
> >> >> > joins on the known fields[1]. However, this only creates
> RelInputRefs
> >> >> which
> >> >> > are not at all associated with the current ref numbering because
> the
> >> the
> >> >> > sub-tables are not present in the original query. Thus, we get
> >> conditions
> >> >> > like:
> >> >> > (=($1,$1), which looks appears TRUE but actually should reference
> the
> >> >> > left/right tables' fields.
> >> >> >
> >> >> > I tried playing around with RelRangeRef and manually managing the
> >> field
> >> >> > offsets in query (similar to BlackBoard), but that call gets
> >> translated
> >> >> > into an actually ALWAYS-TRUE condition and also fails the
> >> >> > JoinUtil#checkCartesianJoin case.
> >> >> >
> >> >> > If I construct the query via standard SQL at the top level (using
> >> known
> >> >> > tables), everything works fine, I think because of the
> ref-numbering
> >> to
> >> >> > which I cannot get access in #toRel().
> >> >> >
> >> >> > Any thoughts on the right way to go about this?
> >> >> >
> >> >> > Thanks much,
> >> >> > Jesse Yates
> >> >> >
> >> >> >
> >> >> > [1] Actually, this meant digging into RelBuilder
> >> >> > <
> >> >>
> >>
> https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/tools/RelBuilder.java#L309
> >> >> >
> >> >> > since dynamic tables require the field names to already be set and
> >> then
> >> >> > uses the column name from the list to get the field index
> >> >>
> >>
>

Re: rewriting table for joining logical partitions

Reply via email to