[jira] [Commented] (DRILL-5514) Enhance VectorContainer to merge two row sets

ASF GitHub Bot (JIRA) Thu, 15 Jun 2017 12:16:30 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16050956#comment-16050956
 ]


ASF GitHub Bot commented on DRILL-5514:
---------------------------------------

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/837#discussion_r122287615
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/BatchSchema.java ---
    @@ -162,20 +162,22 @@ private boolean majorTypeEqual(MajorType t1, 
MajorType t2) {
        * Merge two schema to produce a new, merged schema. The caller is 
responsible
        * for ensuring that column names are unique. The order of the fields in 
the
        * new schema is the same as that of this schema, with the other 
schema's fields
    -   * appended in the order defined in the other schema. The resulting 
selection
    -   * vector mode is the same as this schema. (That is, this schema is 
assumed to
    -   * be the main part of the batch, possibly with a selection vector, with 
the
    -   * other schema representing additional, new columns.)
    +   * appended in the order defined in the other schema.
    +   * <p>
    +   * Merging data with selection vectors is unlikely to be useful, or work 
well.
    --- End diff --
    
    Can you please leave a comment about why this is unlikely to be useful, or 
work well?


> Enhance VectorContainer to merge two row sets
> ---------------------------------------------
>
>                 Key: DRILL-5514
>                 URL: https://issues.apache.org/jira/browse/DRILL-5514
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>             Fix For: 1.11.0
>
>
> Consider the concept of a "record batch" in Drill. On the one hand, one can 
> envision a record batch as a stack of records:
> {code}
> | a1 | b1 | c1 |
> ----------------
> | a2 | b2 | c2 |
> {code}
> But, Drill is columnar. So a record batch is really a "bundle" of vectors:
> {code}
> | a1 |    | b1 |    | c1 |
> | a2 |    | b2 |    | c2 |
> {code}
> There are times when it is handy to build up a record batch as a merge of two 
> different vector bundles:
> {code}
> -- bundle 1 --    -- bundle 2 --
> | a1 |    | b1 |        | c1 |
> | a2 |    | b2 |        | c2 |
> {code}
> For example, consider a reader. The reader implementation might read columns 
> (a, b) from a file, say. Then, the "{{ScanBatch}}" might add (c) as an 
> implicit vector (the file name, say.) The merged set of vectors comprises the 
> final schema: (a, b, c).
> This ticket asks for the code to do the merge:
> * Merge two schemas A = (a, b), B = (c) to create schema C = (a, b, c).
> * Merge two vector containers C1 and C2 to create a new container, C3, that 
> holds the merger of the vectors from the first two.
> Clearly, the merge only makes sense if:
> * The two input containers have the same row count, and
> * The columns in each input container are distinct.
> Because this feature is also useful for tests, add the merge to the "row set" 
> tools also.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5514) Enhance VectorContainer to merge two row sets

Reply via email to