Re: [YAML] add timestamp to a bounded PCollection
Just created https://github.com/apache/beam/pull/29969 On Mon, Jan 8, 2024 at 2:49 PM Robert Bradshaw wrote: > > This does appear to be a significant missing feature. I'll try to make > sure something easier gets in by the next release. See also below. > > On Mon, Jan 8, 2024 at 11:30 AM Ferran Fernández Garrido > wrote: > > > > Hi Yarden, > > > > Since it's a bounded source you could try with Sql transformation > > grouping by the timestamp column. Here are some examples of grouping: > > > > https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml > > > > However, if you want to add a timestamp column in addition to the > > original CSV records then, there are multiple ways to achieve that. > > > > 1) MapToFields: > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/yaml_mapping.md > > [Your timestamp column could be a callable to get the current > > timestamp on each record] > > > > 2) If you need an extra layer of transformation complexity I would > > recommend creating a custom transformation: > > > > # - type: MyCustomTransform > > # name: AddDateTimeColumn > > # config: > > # prefix: 'whatever' > > > > providers: > > - type: 'javaJar' > > config: > > jar: 'gs://path/of/the/java.jar' > > transforms: > > MyCustomTransform: 'beam:transform:org.apache.beam:javatransformation:v1' > > Alternatively you can use PyTransform, if you're more comfortable with > that by invoking it via its fully qualified name. > > pipeline: > transforms: > ... > - type: MyAssignTimestamps > config: > kwarg1: ... > kwarg2: ... > > providers: > type:python > config: > packages: ['py_py_package_identifier'] > transforms: > MyAssignTimestamps: > fully_qualified_package.module.AssignTimestampsPTransform > > > > > Best, > > Ferran > > > > El lun, 8 ene 2024 a las 19:53, Yarden BenMoshe () > > escribió: > > > > > > Hi all, > > > Im quite new to using beam yaml. I am working with a CSV file and want to > > > implement some windowing logic to it. > > > Was wondering what is the right way to add timestamps to each element, > > > assuming I have a column including a timestamp. > > > > > > I am aware of Beam Programming Guide (apache.org) part but not sure how > > > this can be implemented and used from yaml prespective. > > > > > > Thanks > > > Yarden
[Question] Apache Beam Pipeline AWS Credentials
Hello, I am a developer trying to use Apache Beam in Java, and I am having an issue with my AWS credentials expiring before my pipeline is done executing. I'm limited by how I get my AWS credentials, as I set the credentials using the StaticCredentialsProvider class. I set the credentials (which is an access key, secret access key, and session token) in the PipelineOptions object before I create the pipeline. However, I believe the PipelineOptions object cannot be modified during runtime. I'm not sure how to refresh my credentials so the pipeline doesn't break during execution. Some code for how I set my credentials for reference: Credentials awsCreds = ; AwsCredentials credentials = AwsSessionCredentials.create(awsCreds.getAccessKeyId(), awsCreds.getSecretAccessKey(), awsCreds.getSessionToken()); AwsCredentialsProvider provider = StaticCredentialsProvider.create(credentials); pipelineOptions.as(AwsOptions.class).setAwsCredentialsProvider(provider); Any help would be appreciated! Thanks and sincerely, Ramya __ The information contained in this e-mail may be confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Beam High Priority Issue Report (51)
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/29926 [Bug]: FileIO: lack of timeouts may cause the pipeline to get stuck indefinitely https://github.com/apache/beam/issues/29912 [Bug]: floatValueExtractor judge float and double equality directly https://github.com/apache/beam/issues/29825 [Bug]: Usage of logical types breaks Beam YAML Sql https://github.com/apache/beam/issues/29413 [Bug]: Can not use Avro over 1.8.2 with Beam 2.52.0 https://github.com/apache/beam/issues/29099 [Bug]: FnAPI Java SDK Harness doesn't update user counters in OnTimer callback functions https://github.com/apache/beam/issues/29022 [Failing Test]: Python Github actions tests are failing due to update of pip https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader provided by apache beam does not pick the event time for watermarking https://github.com/apache/beam/issues/28715 [Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock https://github.com/apache/beam/issues/28383 [Failing Test]: org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric https://github.com/apache/beam/issues/28339 Fix failing "beam_PostCommit_XVR_GoUsingJava_Dataflow" job https://github.com/apache/beam/issues/28326 Bug: apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not working when using CreateDisposition.CREATE_IF_NEEDED https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. PeriodicImpulse) running in Flink and polling using tracker.defer_remainder have checkpoint size growing indefinitely https://github.com/apache/beam/issues/27616 [Bug]: Unable to use applyRowMutations() in bigquery IO apache beam java https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with inequality filters https://github.com/apache/beam/issues/27314 [Failing Test]: bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1] https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when using Kafka and GroupByKey on Dataflow Runner https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested ROW (described below) https://github.com/apache/beam/issues/26343 [Bug]: apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is flaky https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not propagate a Coder to AvroSource https://github.com/apache/beam/issues/26041 [Bug]: Unable to create exactly-once Flink pipeline with stream source and file sink https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK Harness ProcessBundleProgress https://github.com/apache/beam/issues/24389 [Failing Test]: HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError ContainerFetchException https://github.com/apache/beam/issues/24313 [Flaky]: apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder https://github.com/apache/beam/issues/23944 beam_PreCommit_Python_Cron regularily failing - test_pardo_large_input flaky https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder will drop message id and orderingKey https://github.com/apache/beam/issues/22913 [Bug]: beam_PostCommit_Java_ValidatesRunner_Flink is flakes in org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it https://github.com/apache/beam/issues/21714 PulsarIOTest.testReadFromSimpleTopic is very flaky https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit test action StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial (order 1000 elements) numpy input flakes in non-cython environment https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table destinations returns wrong tableId https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: Connection refused https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) failing: ParDoTest$TimestampTests/OnWindowExpirationTests