[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=640819&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640819 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 23/Aug/21 19:11 Start Date: 23/Aug/21 19:11 Worklog Time Spent: 10m Work Description: lostluck merged pull request #15358: URL: https://github.com/apache/beam/pull/15358 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640819) Time Spent: 5.5h (was: 5h 20m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: P3 > Fix For: 2.33.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=640297&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640297 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 20/Aug/21 16:16 Start Date: 20/Aug/21 16:16 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15358: URL: https://github.com/apache/beam/pull/15358#issuecomment-902805140 Run Go PostCommit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640297) Time Spent: 5h 20m (was: 5h 10m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: P3 > Fix For: 2.33.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=640129&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640129 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 20/Aug/21 06:01 Start Date: 20/Aug/21 06:01 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15358: URL: https://github.com/apache/beam/pull/15358#issuecomment-902452550 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 640129) Time Spent: 5h 10m (was: 5h) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: P3 > Fix For: 2.33.0 > > Time Spent: 5h 10m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=640128&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-640128 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 20/Aug/21 06:00 Start Date: 20/Aug/21 06:00 Worklog Time Spent: 10m Work Description: lostluck opened a new pull request #15358: URL: https://github.com/apache/beam/pull/15358 * Avoid the interstitial PCollection between DoFns and the DataSink, and collect full data from the DataSink. * Don't collect PCollection data from the synthetic PCollection between MergeAccumulators and ExtractOutputs as there are no user analogs for that collection. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). `ValidatesRunner` compliance status (on master branch) Lang ULR Dataflow Flink Samza Spark Twister2 Go --- https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon";> https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon";> https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Samza/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Samza/lastCompletedBuild/badge/icon";> https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon";> --- Java https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_ULR/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_ULR/lastCompletedBuild/badge/icon";> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon?subject=V1";> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming/lastCompletedBuild/badge/icon?subject=V1+Streaming";> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon?subject=V1+Java+11";> https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2/lastCompletedBuild/badge/icon?subject=V2";> https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2_Streaming/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Java_VR_Dataflow_V2_Streaming/lastCompletedBuild/badge/icon?subject=V2+Streaming";> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon?subject=Java+8";> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/";> https://ci-beam.apache.org/job/bea
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=637409&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-637409 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 12/Aug/21 16:29 Start Date: 12/Aug/21 16:29 Worklog Time Spent: 10m Work Description: lostluck merged pull request #15289: URL: https://github.com/apache/beam/pull/15289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 637409) Time Spent: 4h 50m (was: 4h 40m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: P3 > Time Spent: 4h 50m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=637366&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-637366 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 12/Aug/21 14:36 Start Date: 12/Aug/21 14:36 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15289: URL: https://github.com/apache/beam/pull/15289#issuecomment-897693463 Run GoPortable PreCommit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 637366) Time Spent: 4h 40m (was: 4.5h) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: P3 > Time Spent: 4h 40m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=637186&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-637186 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 11/Aug/21 23:53 Start Date: 11/Aug/21 23:53 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15289: URL: https://github.com/apache/beam/pull/15289#issuecomment-897236662 Run Go Postcommit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 637186) Time Spent: 4.5h (was: 4h 20m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: P3 > Time Spent: 4.5h > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=637011&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-637011 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 11/Aug/21 17:42 Start Date: 11/Aug/21 17:42 Worklog Time Spent: 10m Work Description: lostluck commented on a change in pull request #15289: URL: https://github.com/apache/beam/pull/15289#discussion_r687043611 ## File path: sdks/go/pkg/beam/core/runtime/exec/pcollection_test.go ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +//http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package exec + +import ( + "context" + "math" + "testing" + + "github.com/apache/beam/sdks/go/pkg/beam/core/graph/coder" + "github.com/apache/beam/sdks/go/pkg/beam/core/graph/mtime" + "github.com/apache/beam/sdks/go/pkg/beam/core/graph/window" +) + +// TestPCollection verifies that the PCollection node works correctly. +// Seed is by default set to 0, so we have a "deterministic" set of +// randomness for the samples. +func TestPCollection(t *testing.T) { + a := &CaptureNode{UID: 1} + pcol := &PCollection{UID: 2, Out: a, Coder: coder.NewVarInt()} + // The "large" 2nd value is to ensure the values are encoded properly, + // and that Min & Max are behaving. + inputs := []interface{}{int64(1), int64(20), int64(3)} + in := &FixedRoot{UID: 3, Elements: makeInput(inputs...), Out: pcol} + + p, err := NewPlan("a", []Unit{a, pcol, in}) + if err != nil { + t.Fatalf("failed to construct plan: %v", err) + } + + if err := p.Execute(context.Background(), "1", DataContext{}); err != nil { + t.Fatalf("execute failed: %v", err) + } + if err := p.Down(context.Background()); err != nil { + t.Fatalf("down failed: %v", err) + } + + expected := makeValues(inputs...) + if !equalList(a.Elements, expected) { + t.Errorf("multiplex returned %v for a, want %v", extractValues(a.Elements...), extractValues(expected...)) + } + snap := pcol.snapshot() + if want, got := int64(len(expected)), snap.ElementCount; got != want { + t.Errorf("snapshot miscounted: got %v, want %v", got, want) + } + checkPCollectionSizeSample(t, snap, 3, 7, 1, 5) +} + +func TestPCollection_sizeReset(t *testing.T) { + // Check the initial values after resetting. + var pcol PCollection + pcol.resetSize() + snap := pcol.snapshot() + checkPCollectionSizeSample(t, snap, 0, 0, math.MaxInt64, math.MinInt64) +} + +func checkPCollectionSizeSample(t *testing.T, snap PCollectionSnapshot, count, sum, min, max int64) { + t.Helper() + if want, got := int64(count), snap.SizeCount; got != want { + t.Errorf("sample count incorrect: got %v, want %v", got, want) + } + if want, got := int64(sum), snap.SizeSum; got != want { + t.Errorf("sample sum incorrect: got %v, want %v", got, want) + } + if want, got := int64(min), snap.SizeMin; got != want { + t.Errorf("sample min incorrect: got %v, want %v", got, want) + } + if want, got := int64(max), snap.SizeMax; got != want { + t.Errorf("sample max incorrect: got %v, want %v", got, want) + } +} + +// BenchmarkPCollection measures the overhead of invoking a ParDo in a plan. +// +// On @lostluck's desktop (2020/02/20): +// BenchmarkPCollection-12 4469980624.8 ns/op 0 B/op 0 allocs/op +func BenchmarkPCollection(b *testing.B) { + // Pre allocate the capture buffer and process buffer to avoid + // unnecessary overhead. + out := &CaptureNode{UID: 1, Elements: make([]FullValue, 0, b.N)} + process := make([]MainInput, 0, b.N) + for i := 0; i < b.N; i++ { + process = append(process, MainInput{Key: FullValue{ + Windows: window.SingleGlobalWindow, + Timestamp: mtime.ZeroTimestamp, +
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=637008&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-637008 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 11/Aug/21 17:40 Start Date: 11/Aug/21 17:40 Worklog Time Spent: 10m Work Description: lostluck commented on a change in pull request #15289: URL: https://github.com/apache/beam/pull/15289#discussion_r687042489 ## File path: sdks/go/pkg/beam/core/runtime/exec/datasink.go ## @@ -38,7 +35,6 @@ type DataSink struct { wEnc WindowEncoder w io.WriteCloser count int64 Review comment: Ah! Good catch. Missed that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 637008) Time Spent: 4h 10m (was: 4h) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: P3 > Time Spent: 4h 10m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=636675&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-636675 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 11/Aug/21 00:01 Start Date: 11/Aug/21 00:01 Worklog Time Spent: 10m Work Description: youngoli commented on a change in pull request #15289: URL: https://github.com/apache/beam/pull/15289#discussion_r685648169 ## File path: sdks/go/pkg/beam/core/runtime/exec/datasink.go ## @@ -38,7 +35,6 @@ type DataSink struct { wEnc WindowEncoder w io.WriteCloser count int64 Review comment: Do you still need `count` here now that you got rid of the lines using it? ## File path: sdks/go/pkg/beam/core/runtime/exec/pcollection_test.go ## @@ -0,0 +1,154 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +//http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package exec + +import ( + "context" + "math" + "testing" + + "github.com/apache/beam/sdks/go/pkg/beam/core/graph/coder" + "github.com/apache/beam/sdks/go/pkg/beam/core/graph/mtime" + "github.com/apache/beam/sdks/go/pkg/beam/core/graph/window" +) + +// TestPCollection verifies that the PCollection node works correctly. +// Seed is by default set to 0, so we have a "deterministic" set of +// randomness for the samples. +func TestPCollection(t *testing.T) { + a := &CaptureNode{UID: 1} + pcol := &PCollection{UID: 2, Out: a, Coder: coder.NewVarInt()} + // The "large" 2nd value is to ensure the values are encoded properly, + // and that Min & Max are behaving. + inputs := []interface{}{int64(1), int64(20), int64(3)} + in := &FixedRoot{UID: 3, Elements: makeInput(inputs...), Out: pcol} + + p, err := NewPlan("a", []Unit{a, pcol, in}) + if err != nil { + t.Fatalf("failed to construct plan: %v", err) + } + + if err := p.Execute(context.Background(), "1", DataContext{}); err != nil { + t.Fatalf("execute failed: %v", err) + } + if err := p.Down(context.Background()); err != nil { + t.Fatalf("down failed: %v", err) + } + + expected := makeValues(inputs...) + if !equalList(a.Elements, expected) { + t.Errorf("multiplex returned %v for a, want %v", extractValues(a.Elements...), extractValues(expected...)) + } + snap := pcol.snapshot() + if want, got := int64(len(expected)), snap.ElementCount; got != want { + t.Errorf("snapshot miscounted: got %v, want %v", got, want) + } + checkPCollectionSizeSample(t, snap, 3, 7, 1, 5) +} + +func TestPCollection_sizeReset(t *testing.T) { + // Check the initial values after resetting. + var pcol PCollection + pcol.resetSize() + snap := pcol.snapshot() + checkPCollectionSizeSample(t, snap, 0, 0, math.MaxInt64, math.MinInt64) +} + +func checkPCollectionSizeSample(t *testing.T, snap PCollectionSnapshot, count, sum, min, max int64) { + t.Helper() + if want, got := int64(count), snap.SizeCount; got != want { + t.Errorf("sample count incorrect: got %v, want %v", got, want) + } + if want, got := int64(sum), snap.SizeSum; got != want { + t.Errorf("sample sum incorrect: got %v, want %v", got, want) + } + if want, got := int64(min), snap.SizeMin; got != want { + t.Errorf("sample min incorrect: got %v, want %v", got, want) + } + if want, got := int64(max), snap.SizeMax; got != want { + t.Errorf("sample max incorrect: got %v, want %v", got, want) + } +} + +// BenchmarkPCollection measures the overhead of invoking a ParDo in a plan. +// +// On @lostluck's desktop (2020/02/20): +// BenchmarkPCollection-12 4469980624.8 ns/op 0 B/op 0 allocs/op +func BenchmarkPCollection(b *testing.B) { + // Pre allocate the capture buffer and process buffer to avoid + // unnecessary overhead. + out := &CaptureNode{UID: 1, Elements: make([]FullV
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=636098&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-636098 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 09/Aug/21 21:19 Start Date: 09/Aug/21 21:19 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15289: URL: https://github.com/apache/beam/pull/15289#issuecomment-895556871 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 636098) Time Spent: 3h 50m (was: 3h 40m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 3h 50m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=636097&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-636097 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 09/Aug/21 21:17 Start Date: 09/Aug/21 21:17 Worklog Time Spent: 10m Work Description: lostluck commented on a change in pull request #15289: URL: https://github.com/apache/beam/pull/15289#discussion_r685527558 ## File path: sdks/go/pkg/beam/core/runtime/harness/monitoring.go ## @@ -163,38 +162,65 @@ func monitoring(p *exec.Plan) ([]*pipepb.MonitoringInfo, map[string][]byte) { }.ExtractFrom(store) // Get the execution monitoring information from the bundle plan. - if snapshot, ok := p.Progress(); ok { - payload, err := metricsx.Int64Counter(snapshot.Count) + + snapshot, ok := p.Progress() + if !ok { + return monitoringInfo, payloads + } + for _, pcol := range snapshot.PCols { + payload, err := metricsx.Int64Counter(pcol.ElementCount) if err != nil { panic(err) } // TODO(BEAM-9934): This metric should account for elements in multiple windows. - payloads[getShortID(metrics.PCollectionLabels(snapshot.PID), metricsx.UrnElementCount)] = payload + payloads[getShortID(metrics.PCollectionLabels(pcol.ID), metricsx.UrnElementCount)] = payload + monitoringInfo = append(monitoringInfo, &pipepb.MonitoringInfo{ Urn: metricsx.UrnToString(metricsx.UrnElementCount), Type: metricsx.UrnToType(metricsx.UrnElementCount), Labels: map[string]string{ - "PCOLLECTION": snapshot.PID, + "PCOLLECTION": pcol.ID, }, Payload: payload, }) - payloads[getShortID(metrics.PTransformLabels(snapshot.ID), metricsx.UrnDataChannelReadIndex)] = payload - monitoringInfo = append(monitoringInfo, - &pipepb.MonitoringInfo{ - Urn: metricsx.UrnToString(metricsx.UrnDataChannelReadIndex), - Type: metricsx.UrnToType(metricsx.UrnDataChannelReadIndex), - Labels: map[string]string{ - "PTRANSFORM": snapshot.ID, - }, - Payload: payload, - }) + // Skip pcollections without size + if pcol.SizeCount != 0 { + payload, err := metricsx.Int64Distribution(pcol.SizeCount, pcol.SizeSum, pcol.SizeMin, pcol.SizeMax) + if err != nil { + panic(err) + } + monitoringInfo = append(monitoringInfo, + &pipepb.MonitoringInfo{ + Urn: "beam:metric:sampled_byte_size:v1", + Type: "beam:metrics:distribution_int_64", Review comment: Good catch. Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 636097) Time Spent: 3h 40m (was: 3.5h) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 3h 40m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=636093&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-636093 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 09/Aug/21 21:10 Start Date: 09/Aug/21 21:10 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15289: URL: https://github.com/apache/beam/pull/15289#issuecomment-895551534 R: @youngoli @jrmccluskey @riteshghorse -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 636093) Time Spent: 3.5h (was: 3h 20m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 3.5h > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=635982&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-635982 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 09/Aug/21 17:42 Start Date: 09/Aug/21 17:42 Worklog Time Spent: 10m Work Description: lostluck commented on a change in pull request #15289: URL: https://github.com/apache/beam/pull/15289#discussion_r685395050 ## File path: sdks/go/pkg/beam/core/runtime/exec/pcollection.go ## @@ -0,0 +1,153 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +//http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package exec + +import ( + "context" + "fmt" + "math" + "math/rand" + "sync" + "sync/atomic" + + "github.com/apache/beam/sdks/go/pkg/beam/core/graph/coder" +) + +// PCollection is a passthrough node to collect PCollection metrics, and +// must be placed as the Out node of any producer of a PCollection. +// +// In particular, must not be placed after a Multiplex, and must be placed +// after a Flatten. +type PCollection struct { Review comment: IIRC, either I'm adding an object or duplicating all the code. Right now this was simplest, vs modifying each of the other node kinds. The main optimization I can do later is concretely make the Out nodes the PCollection type which will help the compiler inline the method calls avoiding the function call overhead where possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 635982) Time Spent: 3h 20m (was: 3h 10m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 3h 20m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=635976&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-635976 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 09/Aug/21 17:36 Start Date: 09/Aug/21 17:36 Worklog Time Spent: 10m Work Description: lostluck commented on a change in pull request #15289: URL: https://github.com/apache/beam/pull/15289#discussion_r685391106 ## File path: sdks/go/pkg/beam/core/runtime/exec/pcollection.go ## @@ -0,0 +1,153 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +//http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package exec + +import ( + "context" + "fmt" + "math" + "math/rand" + "sync" + "sync/atomic" + + "github.com/apache/beam/sdks/go/pkg/beam/core/graph/coder" +) + +// PCollection is a passthrough node to collect PCollection metrics, and +// must be placed as the Out node of any producer of a PCollection. +// +// In particular, must not be placed after a Multiplex, and must be placed +// after a Flatten. +type PCollection struct { Review comment: If you read the code in [ProcessElement](https://github.com/apache/beam/pull/15289/files/b3d4d1a90d0ccc36150a7ef50d497a0e793a3d9d#diff-a01d1e6315c7f0dc04c1148d3e17203dd25ff567b4f7c88a32b6e7cdc62e7920R83), we use the sampling technique I credited you for a year ago. I'm leaning towards that "always do the first 3 elements" thing, since I went with the random of first 3 three approach which fails on single element PCollections. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 635976) Time Spent: 3h 10m (was: 3h) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 3h 10m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=635973&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-635973 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 09/Aug/21 17:31 Start Date: 09/Aug/21 17:31 Worklog Time Spent: 10m Work Description: lostluck commented on a change in pull request #15289: URL: https://github.com/apache/beam/pull/15289#discussion_r685387873 ## File path: sdks/go/pkg/beam/core/runtime/harness/harness.go ## @@ -305,7 +313,7 @@ func (c *control) handleInstruction(ctx context.Context, req *fnpb.InstructionRe data.Close() state.Close() - mons, pylds := monitoring(plan) + mons, pylds := monitoring(plan, store) Review comment: I tried only having the payloads, but Dataflow doesn't produce the metrics at all then, so we'll keep the monitoring infos around until Dataflow handles only payloads properly. There might be an experiment to toggle to fix this, but there's no harm in the SDK waiting for the default to switch (other than the wire cost of not being exclusively on the short id requests & payload duplication). See https://console.cloud.google.com/dataflow/jobs/us-central1/2021-08-09_10_24_28-7240084760069835414?project=google.com:clouddfe which has both, vs https://console.cloud.google.com/dataflow/jobs/us-central1/2021-08-09_10_00_52-11618126016479762400?project=google.com:clouddfe which is just the payloads -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 635973) Time Spent: 3h (was: 2h 50m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 3h > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=635899&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-635899 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 09/Aug/21 15:00 Start Date: 09/Aug/21 15:00 Worklog Time Spent: 10m Work Description: lostluck commented on a change in pull request #15289: URL: https://github.com/apache/beam/pull/15289#discussion_r685272069 ## File path: sdks/go/pkg/beam/core/runtime/harness/monitoring.go ## @@ -163,38 +162,65 @@ func monitoring(p *exec.Plan) ([]*pipepb.MonitoringInfo, map[string][]byte) { }.ExtractFrom(store) // Get the execution monitoring information from the bundle plan. - if snapshot, ok := p.Progress(); ok { - payload, err := metricsx.Int64Counter(snapshot.Count) + + snapshot, ok := p.Progress() + if !ok { + return monitoringInfo, payloads + } + for _, pcol := range snapshot.PCols { + payload, err := metricsx.Int64Counter(pcol.ElementCount) if err != nil { panic(err) } // TODO(BEAM-9934): This metric should account for elements in multiple windows. - payloads[getShortID(metrics.PCollectionLabels(snapshot.PID), metricsx.UrnElementCount)] = payload + payloads[getShortID(metrics.PCollectionLabels(pcol.ID), metricsx.UrnElementCount)] = payload + monitoringInfo = append(monitoringInfo, &pipepb.MonitoringInfo{ Urn: metricsx.UrnToString(metricsx.UrnElementCount), Type: metricsx.UrnToType(metricsx.UrnElementCount), Labels: map[string]string{ - "PCOLLECTION": snapshot.PID, + "PCOLLECTION": pcol.ID, }, Payload: payload, }) - payloads[getShortID(metrics.PTransformLabels(snapshot.ID), metricsx.UrnDataChannelReadIndex)] = payload - monitoringInfo = append(monitoringInfo, - &pipepb.MonitoringInfo{ - Urn: metricsx.UrnToString(metricsx.UrnDataChannelReadIndex), - Type: metricsx.UrnToType(metricsx.UrnDataChannelReadIndex), - Labels: map[string]string{ - "PTRANSFORM": snapshot.ID, - }, - Payload: payload, - }) + // Skip pcollections without size + if pcol.SizeCount != 0 { + payload, err := metricsx.Int64Distribution(pcol.SizeCount, pcol.SizeSum, pcol.SizeMin, pcol.SizeMax) + if err != nil { + panic(err) + } + monitoringInfo = append(monitoringInfo, + &pipepb.MonitoringInfo{ + Urn: "beam:metric:sampled_byte_size:v1", + Type: "beam:metrics:distribution_int_64", + Labels: map[string]string{ + "PCOLLECTION": pcol.ID, Review comment: That's what those IDs are. (They're taken from the bundle graph and populated in [exec/translate.go](https://github.com/apache/beam/pull/15289/files/b3d4d1a90d0ccc36150a7ef50d497a0e793a3d9d#diff-5f1959624a48fea7877e2dd71f648ea3fe8029b75a84c0ac6f5220888176464aR315) ) Good to validate beyond "it works on Dataflow" :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 635899) Time Spent: 2h 50m (was: 2h 40m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 2h 50m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > e
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=635361&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-635361 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 06/Aug/21 16:22 Start Date: 06/Aug/21 16:22 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15289: URL: https://github.com/apache/beam/pull/15289#issuecomment-894371204 @ajamato You got pinged because you were cc'd on the first version of this PR, which I linked at the top of the description. Thank you for pointing out some areas I can probably clean up and update some of the documentation due to this drift. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 635361) Time Spent: 2h 40m (was: 2.5h) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 2h 40m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=634935&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-634935 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 06/Aug/21 03:12 Start Date: 06/Aug/21 03:12 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15289: URL: https://github.com/apache/beam/pull/15289#issuecomment-893969502 @ajamato i must have accidentally clicked the suggested reviewer button, but I didn't tag you for review. Much appreciated though, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 634935) Time Spent: 2.5h (was: 2h 20m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 2.5h > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=634902&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-634902 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 06/Aug/21 01:02 Start Date: 06/Aug/21 01:02 Worklog Time Spent: 10m Work Description: ajamato commented on a change in pull request #15289: URL: https://github.com/apache/beam/pull/15289#discussion_r683867941 ## File path: sdks/go/pkg/beam/core/runtime/harness/monitoring.go ## @@ -163,38 +162,65 @@ func monitoring(p *exec.Plan) ([]*pipepb.MonitoringInfo, map[string][]byte) { }.ExtractFrom(store) // Get the execution monitoring information from the bundle plan. - if snapshot, ok := p.Progress(); ok { - payload, err := metricsx.Int64Counter(snapshot.Count) + + snapshot, ok := p.Progress() + if !ok { + return monitoringInfo, payloads + } + for _, pcol := range snapshot.PCols { + payload, err := metricsx.Int64Counter(pcol.ElementCount) if err != nil { panic(err) } // TODO(BEAM-9934): This metric should account for elements in multiple windows. - payloads[getShortID(metrics.PCollectionLabels(snapshot.PID), metricsx.UrnElementCount)] = payload + payloads[getShortID(metrics.PCollectionLabels(pcol.ID), metricsx.UrnElementCount)] = payload + monitoringInfo = append(monitoringInfo, &pipepb.MonitoringInfo{ Urn: metricsx.UrnToString(metricsx.UrnElementCount), Type: metricsx.UrnToType(metricsx.UrnElementCount), Labels: map[string]string{ - "PCOLLECTION": snapshot.PID, + "PCOLLECTION": pcol.ID, }, Payload: payload, }) - payloads[getShortID(metrics.PTransformLabels(snapshot.ID), metricsx.UrnDataChannelReadIndex)] = payload - monitoringInfo = append(monitoringInfo, - &pipepb.MonitoringInfo{ - Urn: metricsx.UrnToString(metricsx.UrnDataChannelReadIndex), - Type: metricsx.UrnToType(metricsx.UrnDataChannelReadIndex), - Labels: map[string]string{ - "PTRANSFORM": snapshot.ID, - }, - Payload: payload, - }) + // Skip pcollections without size + if pcol.SizeCount != 0 { + payload, err := metricsx.Int64Distribution(pcol.SizeCount, pcol.SizeSum, pcol.SizeMin, pcol.SizeMax) + if err != nil { + panic(err) + } + monitoringInfo = append(monitoringInfo, + &pipepb.MonitoringInfo{ + Urn: "beam:metric:sampled_byte_size:v1", + Type: "beam:metrics:distribution_int_64", + Labels: map[string]string{ + "PCOLLECTION": pcol.ID, Review comment: This must be the exact same string as the pcollection id passed into the ProcessBundleDescriptor https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/model/fn-execution/src/main/proto/beam_fn_api.proto#L198 ## File path: sdks/go/pkg/beam/core/runtime/harness/monitoring.go ## @@ -163,38 +162,65 @@ func monitoring(p *exec.Plan) ([]*pipepb.MonitoringInfo, map[string][]byte) { }.ExtractFrom(store) // Get the execution monitoring information from the bundle plan. - if snapshot, ok := p.Progress(); ok { - payload, err := metricsx.Int64Counter(snapshot.Count) + + snapshot, ok := p.Progress() + if !ok { + return monitoringInfo, payloads + } + for _, pcol := range snapshot.PCols { + payload, err := metricsx.Int64Counter(pcol.ElementCount) if err != nil { panic(err) } // TODO(BEAM-9934): This metric should account for elements in multiple windows. - payloads[getShortID(metrics.PCollectionLabels(snapshot.PID), metricsx.UrnElementCount)] = payload + payloads[getShortID(metrics.PCollectionLabels(pcol.ID), metricsx.UrnElementCount)] = payload + monitoringInfo = append(monitoringInfo, &pipepb.Monit
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=634888&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-634888 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 06/Aug/21 00:12 Start Date: 06/Aug/21 00:12 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15289: URL: https://github.com/apache/beam/pull/15289#issuecomment-893907683 R: @youngoli @jrmccluskey @riteshghorse -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 634888) Time Spent: 2h 10m (was: 2h) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 2h 10m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=634847&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-634847 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 05/Aug/21 22:44 Start Date: 05/Aug/21 22:44 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15289: URL: https://github.com/apache/beam/pull/15289#issuecomment-893867957 Run Go Samza ValidatesRunner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 634847) Time Spent: 1h 50m (was: 1h 40m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 1h 50m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=634848&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-634848 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 05/Aug/21 22:44 Start Date: 05/Aug/21 22:44 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15289: URL: https://github.com/apache/beam/pull/15289#issuecomment-893868018 Run Go Spark ValidatesRunner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 634848) Time Spent: 2h (was: 1h 50m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 2h > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=634846&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-634846 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 05/Aug/21 22:44 Start Date: 05/Aug/21 22:44 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #15289: URL: https://github.com/apache/beam/pull/15289#issuecomment-893867804 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 634846) Time Spent: 1h 40m (was: 1.5h) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Priority: P3 > Time Spent: 1h 40m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=634830&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-634830 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 05/Aug/21 22:03 Start Date: 05/Aug/21 22:03 Worklog Time Spent: 10m Work Description: lostluck opened a new pull request #15289: URL: https://github.com/apache/beam/pull/15289 Restoring https://github.com/apache/beam/pull/10942 to narrow down where the post submits failed previously. --- This adds PCollection metrics to the Go SDK, in particular, Element Count, and Sampled Size. New exec.PCollection nodes are added between every processing node in the bundle execution graph. * The new metrics are only added as MonitoringInfos, not the legacy protos. * There's about ~10ns added per element per PCollection node due to the atomic additions for every element. * Elements for sizes are selected randomly, then encoded to count their bytes (w/o window headers). * An initial index is selected form the first [0,1,2] at bundle start up, and then pre-select the next index from somewhere later on, proportional to the bundle so far. * As currently set up, it will take around 200-300 samples for the first 1M elements, so encoded overhead is limited * PCollections from a DataSource do 100% "sampling", since they're reading the bytes directly anyway. The PCollection node that would have been added after the DataSource is elided from the graph during construction, but re-used to avoid duplicating the logic for concurrently manipulating the size distribution. * DataSources can properly handle CoGBKs as well, counting non-header bytes for iterables, and state backed iterables. * This still involves a mutex Lock for every update, so we may want to find a lighter weight mechanism to handle the distribution samples from DataSources, or simply opt for the same random sampling. * A similar method could be used for DataSinks as well, but not handled in this PR. * It's important to note that the runner is already aware of the number of bytes sent and received from the SDK side, so we may opt to remove that this entirely. * Counts and Samples are yet not made for SideInputs, which would better account for data consumed by DoFns. Thank you @ajamato for reminding me of the pre-select method for sampling, and @lukecwik for pointing out the DataSource can avoid separate additional encoding costs when measuring elements. Performance impact: I have two jobs I use for benchmarking this: Pipeline A uses int64s as elements and does simple passthroughs and sums, and Pipeline B where it's using large protocol buffers as elements, which spends a fair amount of CPU time decoding them. For small "fast" elements, the overhead is about ~19.5% of the Go side processing (which makes sense if elements are just being passed around or incremented). For large "heavy" elements, the overhead is about ~0.125% of the Go side of processing. Specifically, this is only taking into account the Go SDK worker, and not any runner side costs. This feels acceptable for the time being, though it's possible we can improve this later, especially for "lighter" jobs. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). `ValidatesRunner` compliance status (on master branch) Lang ULR Dataflow Flink Samza Spark Twister2 Go --- https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/";> https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon";> https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/";> https://ci-bea
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=398804&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-398804 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 06/Mar/20 00:49 Start Date: 06/Mar/20 00:49 Worklog Time Spent: 10m Work Description: lostluck commented on issue #11061: Revert "[BEAM-6374] Emit PCollection metrics from GoSDK" URL: https://github.com/apache/beam/pull/11061#issuecomment-595519768 Run Go Postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 398804) Time Spent: 1h 20m (was: 1h 10m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=398801&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-398801 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 06/Mar/20 00:48 Start Date: 06/Mar/20 00:48 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #11061: Revert "[BEAM-6374] Emit PCollection metrics from GoSDK" URL: https://github.com/apache/beam/pull/11061 Reverts apache/beam#10942 Seems to be breaking the post commit. Since I'm going on vacation tonight, I'm rolling to back, and will look into it when I get back. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 398801) Time Spent: 1h 10m (was: 1h) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=397885&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-397885 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 04/Mar/20 21:03 Start Date: 04/Mar/20 21:03 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #10942: [BEAM-6374] Emit PCollection metrics from GoSDK URL: https://github.com/apache/beam/pull/10942 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 397885) Time Spent: 1h (was: 50m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=393015&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-393015 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 25/Feb/20 22:36 Start Date: 25/Feb/20 22:36 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #10942: [BEAM-6374] Emit PCollection metrics from GoSDK URL: https://github.com/apache/beam/pull/10942#discussion_r384168998 ## File path: sdks/go/pkg/beam/core/runtime/exec/datasource_test.go ## @@ -201,6 +201,16 @@ func TestDataSource_Iterators(t *testing.T) { if got, want := iVals, expectedKeys; !equalList(got, want) { t.Errorf("DataSource => %#v, want %#v", extractValues(got...), extractValues(want...)) } + + // We're using small ints, so do some quick math to validate. + sizeOfSmallInt := 1 Review comment: You're right that my wording is ambiguous. I literally mean small ints as in 0-127 which are definitely encoded as a single byte with varint64 encoding. If we use larger integers (as one of the PCollection tests do), the encoded size is larger. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 393015) Time Spent: 50m (was: 40m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=392928&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-392928 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 25/Feb/20 21:09 Start Date: 25/Feb/20 21:09 Worklog Time Spent: 10m Work Description: youngoli commented on pull request #10942: [BEAM-6374] Emit PCollection metrics from GoSDK URL: https://github.com/apache/beam/pull/10942#discussion_r383583118 ## File path: sdks/go/pkg/beam/core/runtime/exec/plan.go ## @@ -178,19 +160,25 @@ func (p *Plan) String() string { return fmt.Sprintf("Plan[%v]:\n%v", p.ID(), strings.Join(units, "\n")) } -// Progress returns a snapshot of input progress of the plan, and associated metrics. -func (p *Plan) Progress() (ProgressReportSnapshot, bool) { - if p.source != nil { - return p.source.Progress(), true - } - return ProgressReportSnapshot{}, false +// PlanSnapshot contains system metrics for the current run of the plan. +type PlanSnapshot struct { + Source ProgressReportSnapshot + PCols []PCollectionSnapshot } -// Store returns the metric store for the last use of this plan. -func (p *Plan) Store() *metrics.Store { - p.storeMu.Lock() - defer p.storeMu.Unlock() - return p.store +// Progress returns a snapshot of progress of the plan, and associated metrics. Review comment: This comment should probably be updated to explain that the bool returned represents whether the snapshot has a DataSource, as opposed to the usual assumption of an "ok" value. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 392928) Time Spent: 0.5h (was: 20m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=392929&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-392929 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 25/Feb/20 21:09 Start Date: 25/Feb/20 21:09 Worklog Time Spent: 10m Work Description: youngoli commented on pull request #10942: [BEAM-6374] Emit PCollection metrics from GoSDK URL: https://github.com/apache/beam/pull/10942#discussion_r384070704 ## File path: sdks/go/pkg/beam/core/runtime/exec/datasource_test.go ## @@ -201,6 +201,16 @@ func TestDataSource_Iterators(t *testing.T) { if got, want := iVals, expectedKeys; !equalList(got, want) { t.Errorf("DataSource => %#v, want %#v", extractValues(got...), extractValues(want...)) } + + // We're using small ints, so do some quick math to validate. + sizeOfSmallInt := 1 Review comment: Just checking, we use the size of a small int instead of an int64 because the keys are encoded as small ints, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 392929) Time Spent: 40m (was: 0.5h) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=391291&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-391291 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 23/Feb/20 04:56 Start Date: 23/Feb/20 04:56 Worklog Time Spent: 10m Work Description: lostluck commented on pull request #10942: [BEAM-6374] Emit PCollection metrics from GoSDK URL: https://github.com/apache/beam/pull/10942 This adds PCollection metrics to the Go SDK, in particular, Element Count, and Sampled Size. New exec.PCollection nodes are added between every processing node in the bundle execution graph. * The new metrics are only added as MonitoringInfos, not the legacy protos. * There's about ~10ns added per element per PCollection node due to the atomic additions for every element. * Elements for sizes are selected randomly, then encoded to count their bytes (w/o window headers). * An initial index is selected form the first [0,1,2] at bundle start up, and then pre-select the next index from somewhere later on, proportional to the bundle so far. * As currently set up, it will take around 200-300 samples for the first 1M elements, so encoded overhead is limited * PCollections from a DataSource do 100% "sampling", since they're reading the bytes directly anyway. The PCollection node that would have been added after the DataSource is elided from the graph during construction, but re-used to avoid duplicating the logic for concurrently manipulating the size distribution. * DataSources can properly handle CoGBKs as well, counting non-header bytes for iterables, and state backed iterables. * This still involves a mutex Lock for every update, so we may want to find a lighter weight mechanism to handle the distribution samples from DataSources, or simply opt for the same random sampling. * A similar method could be used for DataSinks as well, but not handled in this PR. * It's important to note that the runner is already aware of the number of bytes sent and received from the SDK side, so we may opt to remove that this entirely. * Counts and Samples are yet not made for SideInputs, which would better account for data consumed by DoFns. Thank you @ajamato for reminding me of the pre-select method for sampling, and @lukecwik for pointing out the DataSource can avoid separate additional encoding costs when measuring elements. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
[jira] [Work logged] (BEAM-6374) "elements added" for input and output collections is always empty
[ https://issues.apache.org/jira/browse/BEAM-6374?focusedWorklogId=391292&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-391292 ] ASF GitHub Bot logged work on BEAM-6374: Author: ASF GitHub Bot Created on: 23/Feb/20 04:56 Start Date: 23/Feb/20 04:56 Worklog Time Spent: 10m Work Description: lostluck commented on issue #10942: [BEAM-6374] Emit PCollection metrics from GoSDK URL: https://github.com/apache/beam/pull/10942#issuecomment-590027126 R: @youngoli cc: @ajamato @lukecwik This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 391292) Time Spent: 20m (was: 10m) > "elements added" for input and output collections is always empty > - > > Key: BEAM-6374 > URL: https://issues.apache.org/jira/browse/BEAM-6374 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-go >Reporter: Andrew Brampton >Assignee: Robert Burke >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > The field for "Elements added" and "Estimated size" is always blank when > running a Go binary on Dataflow. For example when running the work count > example: https://pasteboard.co/HVf80BU.png -- This message was sent by Atlassian Jira (v8.3.4#803005)