CoGian opened a new issue, #27486:
URL: https://github.com/apache/beam/issues/27486

   ### What happened?
   
   Apache Beam: 2.44.0
   Runners: DataflowRunner, DirectRunner
   
   I have this pipeline which reads events from datastore and I want to add a 
filter to the query. 
   ```python
           with Pipeline(options=self.options) as p:
               read_event_query = Query(
                   kind=self.configs.datastore.events_kind,
                   project=self.configs.cloud.project_id,
                   namespace=self.configs.datastore.namespace,
               )
               start_timestamp = int(round(time.time() * 1000) - 6.048e+8)
               read_event_query.filters = [('timestamp' , '>', start_timestamp)]
   
               events = (
                       p
                       | 'Read from datastore' >> 
ReadFromDatastore(query=read_event_query, num_splits=1)
               )
   
               _ = (events
                    | Map(print))
   ```
   
   It outputs this error: 
   ```
   2023-07-13 12:12:27,713: INFO: Unable to parallelize the given query: 
<Query(kind=events, project=atypon-partnersolutions, namespace=non-production, 
ancestor=None, filters=[('timestamp', '>', 1688634747019)],projection=(), 
order=(), distinct_on=(), limit=None)>
   Traceback (most recent call last):
     File 
"/home/kgiantsios/miniconda3/envs/*****/lib/python3.10/site-packages/apache_beam/io/gcp/datastore/v1new/datastoreio.py",
 line 184, in process
       query_splitter.validate_split(query)
     File 
"/home/kgiantsios/miniconda3/envs/****/lib/python3.10/site-packages/apache_beam/io/gcp/datastore/v1new/query_splitter.py",
 line 109, in validate_split
       raise SplitNotPossibleError('Query cannot have any inequality filters.')
   apache_beam.io.gcp.datastore.v1new.query_splitter.SplitNotPossibleError: 
Query cannot have any inequality filters.
   
   ```
   
   Although in Docstring of ReadFromDatastore notes:
   ```
    However, when the `query` is configured with a `limit` or if the
     query contains inequality filters like `GREATER_THAN, LESS_THAN` etc., then
     all the returned results will be read by a single worker in order to ensure
     correct data. Since data is read from a single worker, this could have
     significant impact on the performance of the job.
   ```
   
   Shouldn't the operation proceed even with the limitation of the single 
worker? 
   
   
   ### Issue Priority
   
   Priority: 1 (data loss / total loss of function)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [X] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [X] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to