pritamdodeja opened a new issue, #35573:
URL: https://github.com/apache/beam/issues/35573

   ### What would you like to happen?
   
   Dataset creation for IoT data, where the data is already on disk, and 
bounded, is difficult with python sdk.  Generally, the data is sequence data in 
key value format, and the desired shape is equally spaced with respect to time, 
and in wide format (i.e. indexed by time, with feature axis).
   
   Once the data is smoothened, there is other processing that needs to be 
done, depending on the problem objective (e.g. add a sequence dimension, 
produce tuples with X, y, possibly with a timegap between the two). When the 
data is already on disk, and there are large gaps in the data, using state to 
capture the last known value does not help as computation is parallelized.
   
   It would be great to have time series related transforms in python SDK that 
make it possible to produced smoothened versions of data, possibly with other 
parameters (e.g. frequency of sampling, time gap, labels).
   
   In the absence of this feature, users wishing to use Apache beam to produce 
large scale ML datasets are left to deal with complex windowing and other 
aspects on their own.  Given that this data is usually quite large, distributed 
processing on Dataflow (for my particular scenario) is the only way to produce 
these datasets, so there are no other viable options currently.  
   
   Some of these capabilities (smoothening, splitting X, y) are likely 
applicable in both batch and online settings.
   
   ### Issue Priority
   
   Priority: 2 (default / most feature requests should be filed as P2)
   
   ### Issue Components
   
   - [x] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [x] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to