[ https://issues.apache.org/jira/browse/HUDI-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sagar Sumit closed HUDI-7945. ----------------------------- Resolution: Fixed > Fix partition pruning using PARTITION_STATS index in Spark > ---------------------------------------------------------- > > Key: HUDI-7945 > URL: https://issues.apache.org/jira/browse/HUDI-7945 > Project: Apache Hudi > Issue Type: Improvement > Reporter: Ethan Guo > Assignee: Ethan Guo > Priority: Major > Labels: pull-request-available > Fix For: 1.0.0-beta2, 1.0.0 > > > The issue can be reproduced by > [https://github.com/apache/hudi/pull/11472#issuecomment-2199332859.] > When there are more than one base files in a table partition, the > corresponding PARTITION_STATS index record in the metadata table contains > null as the file_path field in HoodieColumnRangeMetadata. > {code:java} > private static <T extends Comparable<T>> HoodieColumnRangeMetadata<T> > mergeRanges(HoodieColumnRangeMetadata<T> one, > > HoodieColumnRangeMetadata<T> another) { > > ValidationUtils.checkArgument(one.getColumnName().equals(another.getColumnName()), > "Column names should be the same for merging column ranges"); > final T minValue = getMinValueForColumnRanges(one, another); > final T maxValue = getMaxValueForColumnRanges(one, another); > return HoodieColumnRangeMetadata.create( > null, one.getColumnName(), minValue, maxValue, > one.getNullCount() + another.getNullCount(), > one.getValueCount() + another.getValueCount(), > one.getTotalSize() + another.getTotalSize(), > one.getTotalUncompressedSize() + another.getTotalUncompressedSize()); > } > {code} > The null causes NPE when loading the column stats per partition from > PARTITION_STATS index. Also, current implementation of > PartitionStatsIndexSupport assumes that the file_path field contains the > exact file name and it does not work if the the file path does not contain > null (even a list of file names stored does not work). We have to > reimplement PartitionStatsIndexSupport so that it gives the pruned partitions > for further processing. > {code:java} > Caused by: java.lang.NullPointerException: element cannot be mapped to a null > key > at java.util.Objects.requireNonNull(Objects.java:228) > at java.util.stream.Collectors.lambda$groupingBy$45(Collectors.java:907) > at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at java.util.Iterator.forEachRemaining(Iterator.java:116) > at > java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) > at > java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) > at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747) > at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721) > at java.util.stream.AbstractTask.compute(AbstractTask.java:327) > at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) > at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) > at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) > at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) > at > java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) > at > org.apache.hudi.common.data.HoodieListPairData.groupByKey(HoodieListPairData.java:115) > at > org.apache.hudi.ColumnStatsIndexSupport.transpose(ColumnStatsIndexSupport.scala:253) > at > org.apache.hudi.ColumnStatsIndexSupport.$anonfun$loadTransposed$1(ColumnStatsIndexSupport.scala:149) > at > org.apache.hudi.HoodieCatalystUtils$.withPersistedData(HoodieCatalystUtils.scala:61) > at > org.apache.hudi.ColumnStatsIndexSupport.loadTransposed(ColumnStatsIndexSupport.scala:148) > at > org.apache.hudi.ColumnStatsIndexSupport.computeCandidateFileNames(ColumnStatsIndexSupport.scala:101) > at > org.apache.hudi.HoodieFileIndex.$anonfun$lookupCandidateFilesInMetadataTable$3(HoodieFileIndex.scala:354) > at > org.apache.hudi.HoodieFileIndex.$anonfun$lookupCandidateFilesInMetadataTable$3$adapted(HoodieFileIndex.scala:351) > at > scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985) > at scala.collection.immutable.List.foreach(List.scala:431) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984) > at > org.apache.hudi.HoodieFileIndex.$anonfun$lookupCandidateFilesInMetadataTable$1(HoodieFileIndex.scala:351) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.hudi.HoodieFileIndex.lookupCandidateFilesInMetadataTable(HoodieFileIndex.scala:338) > at > org.apache.hudi.HoodieFileIndex.filterFileSlices(HoodieFileIndex.scala:241) > ... 106 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)