pritamdodeja opened a new issue, #35573: URL: https://github.com/apache/beam/issues/35573
### What would you like to happen? Dataset creation for IoT data, where the data is already on disk, and bounded, is difficult with python sdk. Generally, the data is sequence data in key value format, and the desired shape is equally spaced with respect to time, and in wide format (i.e. indexed by time, with feature axis). Once the data is smoothened, there is other processing that needs to be done, depending on the problem objective (e.g. add a sequence dimension, produce tuples with X, y, possibly with a timegap between the two). When the data is already on disk, and there are large gaps in the data, using state to capture the last known value does not help as computation is parallelized. It would be great to have time series related transforms in python SDK that make it possible to produced smoothened versions of data, possibly with other parameters (e.g. frequency of sampling, time gap, labels). In the absence of this feature, users wishing to use Apache beam to produce large scale ML datasets are left to deal with complex windowing and other aspects on their own. Given that this data is usually quite large, distributed processing on Dataflow (for my particular scenario) is the only way to produce these datasets, so there are no other viable options currently. Some of these capabilities (smoothening, splitting X, y) are likely applicable in both batch and online settings. ### Issue Priority Priority: 2 (default / most feature requests should be filed as P2) ### Issue Components - [x] Component: Python SDK - [ ] Component: Java SDK - [ ] Component: Go SDK - [ ] Component: Typescript SDK - [ ] Component: IO connector - [ ] Component: Beam YAML - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Infrastructure - [ ] Component: Spark Runner - [ ] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [x] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org