Steve Carlin created IMPALA-12961:
-------------------------------------

             Summary: Use a Map instead of an ArrayList for Expr in HDFS RelNode
                 Key: IMPALA-12961
                 URL: https://issues.apache.org/jira/browse/IMPALA-12961
             Project: IMPALA
          Issue Type: Sub-task
            Reporter: Steve Carlin


This came up in code review in ImpalaHdfsScanRel:

"For wide tables where we are only needing a few columns projected, we will end 
up with a long list with mostly Nulls. A LinkedHashMap (preserves Insertion 
order) where the key is position and value is the SlotRef would be better 
suited despite the cpu cost of hashing. In general, in a query planner, memory 
is the most precious commodity since the plan search space can be large, so 
anything we can do to reduce memory footprint would be preferred."

One counter argument:  The list is used in other Rel Nodes, and it seems more 
natural.  For instance, the Project RelNode will have a RexInputRef RexNode 
which is "$2".  It seems more natural to have an array in this case.  Every 
other RelNode works this way except for the ScanNode.

To add to the counter argument: Let's take a worst case scenario of a query 
that has 10 tables with 500 columns apiece.    If we are allocating 8 byte 
pointers, we would need 10*500*8 to hold this information, which is 40,000 
bytes.  While reducing the memory footprint is more important, reducing it by 
40,000 bytes really isn't going to make an impact.  Even if we take into 
account that multiple queries would be running simultaneously, this is a very 
shortlived code path.  So should we go with the more natural approach versus 
the less memory intensive approach?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to