[ https://issues.apache.org/jira/browse/HUDI-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Kudinkin updated HUDI-5442: ---------------------------------- Story Points: 2 (was: 5) > Fix HiveHoodieTableFileIndex to use lazy listing > ------------------------------------------------ > > Key: HUDI-5442 > URL: https://issues.apache.org/jira/browse/HUDI-5442 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core, trino-presto > Reporter: Ethan Guo > Assignee: Ethan Guo > Priority: Blocker > Fix For: 0.13.0 > > > Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, > using eager listing only. This leads to scanning all table partitions in the > file index, regardless of the queryPaths provided (for Trino Hive connector, > only one partition is passed in). > {code:java} > public HiveHoodieTableFileIndex(HoodieEngineContext engineContext, > HoodieTableMetaClient metaClient, > TypedProperties configProperties, > HoodieTableQueryType queryType, > List<Path> queryPaths, > Option<String> specifiedQueryInstant, > boolean shouldIncludePendingCommits > ) { > super(engineContext, > metaClient, > configProperties, > queryType, > queryPaths, > specifiedQueryInstant, > shouldIncludePendingCommits, > true, > new NoopCache(), > false); > } {code} > After flipping it to true for testing, the following exception is thrown. > {code:java} > io.trino.spi.TrinoException: Failed to parse partition column values from the > partition-path: likely non-encoded slashes being used in partition column's > values. You can try to work this around by switching listing mode to eager > at > io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284) > at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38) > at io.trino.$gen.Trino_392____20221217_092723_2.run(Unknown Source) > at > io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:833) > Caused by: org.apache.hudi.exception.HoodieException: Failed to parse > partition column values from the partition-path: likely non-encoded slashes > being used in partition column's values. You can try to work this around by > switching listing mode to eager > at > org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317) > at > org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) > at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) > at > org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291) > at > org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205) > at > org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216) > at > org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325) > at > org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68) > at > io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:493) > at > io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25) > at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:97) > at > io.trino.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:493) > at > io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:353) > at > io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:277) > ... 6 more {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)