tomsimpkins commented on issue #36665:
URL: https://github.com/apache/arrow/issues/36665#issuecomment-1634532364

   A more representative example would look like this:
   ```
   dataset = dataset.project({ "food": ds.field("food"), "costPlusOne": 
ds.field("cost") + 1 })
     .project({ 
         "food": ds.field("food"), 
          "costPlusTwo": ds.field("costPlusOne") + 1, 
          "costPlusThree": ds.field("costPlusOne") + 2, 
          "costPlusFour": ds.field("costPlusOne") + 3 
   })
   ```
   
   It seems convenient to be able to reference the intermediate column 
(`costPlusOne`) to build up the expressions.  OTOH we could handle the 
expressions outside of the dataset and do a single projection (i.e. `columns` 
in `.scanner` and `.to_table`:
   ```
   dataset.to_table(columns={      
          "food": ds.field("food"), 
          "costPlusTwo": ds.field("cost") + 1 + 1, 
          "costPlusThree": ds.field("cost") + 1 + 2, 
          "costPlusFour": ds.field("cost") + 1 + 3 
   })
   ```
   
   I don't know details of the execution / query planning side, but would have 
anticipated the repeated work is easier to optimise out in the former.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to