Paul Rogers created DRILL-5384:
----------------------------------

             Summary: Sort cannot directly access map members, causes a data 
copy
                 Key: DRILL-5384
                 URL: https://issues.apache.org/jira/browse/DRILL-5384
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.10.0
            Reporter: Paul Rogers
            Priority: Minor


Suppose we have a JSON structure for "orders" like this:

{code}
{ customer: { id: 10, name: "fred" },
  order: { id: 20, product: "Frammis 1000" } }
{code}

Suppose I want to sort by customer.id. Today, Drill will project customer.id up 
to the top level as a temporary, hidden field. Drill will copy the data from 
the customer.id vector to this new temporary field. Drill then sorts on the 
temporary column, and uses another project to remove the columns.

Clearly, this work, but it has a cost:

* Extra two project operators.
* Extra memory copy.
* Sort must buffer both the original and copied data. This can double memory 
use in the worst case.

All of this is done simply to avoid having to reference "customer.id" in the 
sort.

But, as explained in DRILL-5376, maps are just nested tuples; there is no need 
to copy the data, the data is already right there in a value vector. The 
problem is that Drill's map implementation makes it hard for the generated code 
to get at the "customer.id" vector.

This ticket asks to allow the sort to work directly with nested scalars to 
avoid the overhead explained above. To do this:

1. Fix nested scalar access to allow the generated code to easily access a 
nested scalar.
2. Allow a sort key of the form "customer.id".
3. Modify the planner to generate such sort keys instead of the dual projects.

The result will be a leaner, faster sort operation when sorting on scalars 
within a map.
  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to