alamb commented on issue #1221: URL: https://github.com/apache/arrow-datafusion/issues/1221#issuecomment-968850180
> Can you please explain the "data locality" requirements a little more ? I
think for normal source tasks which read data from remote storage(cloud storage
or Hdfs), there is no data locality. And for shuffle readers which have to read
data from all map tasks, there is no data locality either.
I was thinking of a plan such as the following. There may be cases when
reshuffling between scan/filter and aggregate is worthwhile (e.g. to distribute
the load better) I think the cost of reshuffling will mostly end up dominating
any savings
```
rest of plan
│
│
│
┌ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ┐
▼
│ ┌───────────────────┐ │
│ HashAggregate │
│ └───────────────────┘ │
│ Data is not reshuffled
│ │ │ between scan, filter and
▼ ◀ ─ ─ ─ ─ ─ aggregate
│ ┌───────────────────┐ │
│ Filter │
│ └───────────────────┘ │
│
│ │ │
▼
│ ┌───────────────────┐ │
│ TableScan │
│ └───────────────────┘ │
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
