[I] [SUPPORT] How to skip some partitions in a table when readStreaming in Spark at the init stage [hudi]

via GitHub Mon, 11 Dec 2023 22:37:59 -0800


lei-su-awx opened a new issue, #10315:
URL: https://github.com/apache/hudi/issues/10315


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I have a table partition by operation type ingestion date(like 
insert/2023-12-11/, update/2023-12-11/, delete/2023-12-11/), when I read(use 
spark readStream) this table, I just want to read data under `update` 
partition. And I found a config 'hoodie.datasource.read.incr.path.glob', then I 
use this config and value is `update/202*`. But I found spark job init very 
slow, and found job was stuck 
   <img width="1534" alt="image" 
src="https://github.com/apache/hudi/assets/19327659/febba726-b441-4b01-abb0-2e12f8bc62d7";>.
   But this parquet file is not under `update` partition, it is under `insert` 
partition, which is very confused.
   So I want ask is there a config that can only read the target partition and 
skip others and also does not read other partitions' data files to get schema. 
   
   
   **Expected behavior**
   
   I want to know is there a config that can only read the target partition and 
skip others and also does not read other partitions' data files to get schema. 
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.4.1
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I use the below configurations to write to table:
   hudi_write_options = {
           'hoodie.table.name': hudi_table_name,
           'hoodie.datasource.write.partitionpath.field': 'operation_type, 
ingestion_dt',
           'hoodie.datasource.write.operation': 'insert',
           'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
           'hoodie.parquet.compression.codec': 'zstd',
           "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
           "hoodie.datasource.write.reconcile.schema": True,
           "hoodie.metadata.enable": True
       }
   
   And I use the configurations to read from table:
   read_streaming_hudi_options = {
           'maxFilesPerTrigger': 5,
           'hoodie.datasource.read.incr.path.glob': 'update/202*',
           'hoodie.read.timeline.holes.resolution.policy': 'BLOCK',
           
'hoodie.datasource.read.file.index.listing.partition-path-prefix.analysis.enabled':
 False,
           'hoodie.file.listing.parallelism': 1000,
           'hoodie.metadata.enable': True,
           'hoodie.datasource.read.schema.use.end.instanttime': True,
           'hoodie.datasource.streaming.startOffset': '20231211000000000'
       }
   
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] How to skip some partitions in a table when readStreaming in Spark at the init stage [hudi]

Reply via email to