[ https://issues.apache.org/jira/browse/HUDI-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Istvan Darvas updated HUDI-4046: -------------------------------- Description: Hi Guys! I would like to controll the number of partions which will be read by HUDI. base_path: str partition_paths: List[str] = ["prefix/part1","prefix/part2","prefix/part3"] table_df= (spark.read .format('org.apache.hudi') .option("basePath", base_path) .option("hoodie.datasource.read.paths",",".join(partition_paths)) # coma separated list .load(partition_paths)) This is working if I explicitly set "hoodie.datasource.read.paths". Actually I need to generate a comaseparated list for that parameter. If I do not set it, I got a HUDI exception which tells me I need to set it. It would be grate if HUDI would use the partition_paths from the Spark Read API - list[str] - one more thing: I do not get exception If do not set "hoodie.datasource.read.paths" and I use load(base_path), but in this case spark HUDI read will read up the whole table which can be very timeconsuming with a very big table with lots of partitions. Darvi Connected SLACK thread: [https://apache-hudi.slack.com/archives/C4D716NPQ/p1651667472584579] was: Hi Guys! I would like to controll the number of partions which will be read by HUDI. base_path: str partition_paths: List[str] = ["prefix/part1","prefix/part2","prefix/part3"] ingress_pkg_arrived = (spark.read .format('org.apache.hudi') .option("basePath", base_path) .option("hoodie.datasource.read.paths",",".join(partition_paths)) # coma separated list .load(partition_paths)) This is working if I explicitly set "hoodie.datasource.read.paths". Actually I need to generate a comaseparated list for that parameter. If I do not set it, I got a HUDI exception which tells me I need to set it. It would be grate if HUDI would use the partition_paths from the Spark Read API - list[str] - one more thing: I do not get exception If do not set "hoodie.datasource.read.paths" and I use load(base_path), but in this case spark HUDI read will read up the whole table which can be very timeconsuming with a very big table with lots of partitions. Darvi Connected SLACK thread: https://apache-hudi.slack.com/archives/C4D716NPQ/p1651667472584579 > spark.read.load API > ------------------- > > Key: HUDI-4046 > URL: https://issues.apache.org/jira/browse/HUDI-4046 > Project: Apache Hudi > Issue Type: Bug > Affects Versions: 0.10.1 > Reporter: Istvan Darvas > Priority: Minor > > Hi Guys! > I would like to controll the number of partions which will be read by HUDI. > > base_path: str > partition_paths: List[str] = ["prefix/part1","prefix/part2","prefix/part3"] > table_df= (spark.read > .format('org.apache.hudi') > .option("basePath", base_path) > .option("hoodie.datasource.read.paths",",".join(partition_paths)) # coma > separated list > .load(partition_paths)) > > This is working if I explicitly set "hoodie.datasource.read.paths". Actually > I need to generate a comaseparated list for that parameter. > If I do not set it, I got a HUDI exception which tells me I need to set it. > > It would be grate if HUDI would use the partition_paths from the Spark Read > API - list[str] - > > one more thing: > I do not get exception If do not set "hoodie.datasource.read.paths" and I > use load(base_path), but in this case spark HUDI read will read up the whole > table which can be very timeconsuming with a very big table with lots of > partitions. > > Darvi > Connected SLACK thread: > [https://apache-hudi.slack.com/archives/C4D716NPQ/p1651667472584579] > -- This message was sent by Atlassian Jira (v8.20.7#820007)