GitHub user bobbai00 added a comment to the discussion: Refactor: Decoupling
Direct Database Connection From ComputingUnitMaster & ComputingUnitWorker
Here is the layout of Physical Plan
# Physical Plan Spec
Layout:
```json
{
"operators": [...list of physical operators...],
"links": [...list of physical links...]
}
```
## Physical Operator Spec
```json
{
"id": {
"logicalOpId": { "id": "CSVScanSource-operator-id" },
"layerName": "main" // distinguishes
physical stages from same logical op: main | partial | final
},
"workflowId": { "id": 0 },
"executionId": { "id": 1 },
"opExecInitInfo": { // tells Amber how to
construct the runtime executor
// JVM operators use kind "className":
"kind": "className",
"className":
"org.apache.texera.amber.operator.source.scan.csv.CSVScanSourceOpExec",
"descString": "{...a JSON STRING that describes the property of the
physical operator...}"
// For scan sources (CSV/JSONL/Arrow/file), source path lives here as
`fileName`.
It looks like this: `dataset:///dataset-15/versionHash/raw/data.csv`
(if the file is resolved on local file system, it will start with `file:///...`)
// For UDF operators, the descStringuse kind "code" instead:
// { "kind": "code", "code": "class ProcessTupleOperator(...): ...",
"language": "python" }
},
"parallelizable": true,
"locationPreference": { "type": "roundRobin" },
"partitionRequirement": [], // what each INPUT
expects (array: one entry per input port)
// null -> no requirement for that
input
// { "type": "single" } -> gather into one
partition
// { "type": "hash", "hashAttributeNames": ["id"] } -> hash-partitioned by
attributes
// { "type": "broadcast" } -> broadcast to workers
// { "type": "oneToOne" } -> partitioning maps
one-to-one
// { "type": "none" } -> no partitioning
"partitionDeriveSpec": { "type": "passthrough" }, // what partitioning
this operator PRODUCES
// passthrough -> preserve upstream
partitioning
// toSingle -> produce a single
partition
// toHash + hashAttributeNames -> produce hash
partitioning
// toUnknown -> partitioning unknown
// projection -> derive through
projection
"inputPortsSerialized": {}, // map keyed
"<portId>_<internalFlag>", e.g. "0_false"
"outputPortsSerialized": {}, // value = 2-item
array: [portMetadata, schema|null]
// portMetadata: { id:{id,internal}, displayName, blocking, mode }
// output `mode`: 0 = set snapshot | 1 = set delta | 2 = single snapshot
// schema: { attributes: [ { attributeName, attributeType }, ... ] } or null
// attributeType: string | integer (32-bit) | long (64-bit) | double |
// boolean | timestamp | binary | large_binary
(pointer-like)
"isOneToManyOp": false,
"suggestedWorkerNum": 1,
"pveName": ""
}
```
## Physical Link Spec
Each item in `links` connects one physical output port to one physical input
port.
```json
{
"fromOpId": {
"logicalOpId": { "id": "source-op-id" },
"layerName": "main"
},
"fromPortId": { "id": 0, "internal": false },
"toOpId": {
"logicalOpId": { "id": "target-op-id" },
"layerName": "main"
},
"toPortId": { "id": 0, "internal": false }
}
```
@Yicong-Huang @aglinxinyuan @Xiao-zhen-Liu Is this interpretation accurate ? If
so I don't think physical plan contains any sensitive information and we can
safely expose it to the client.
GitHub link:
https://github.com/apache/texera/discussions/5295#discussioncomment-17204773
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]