kkrugler opened a new issue, #8147: URL: https://github.com/apache/hudi/issues/8147
**Describe the problem you faced** When using Flink to do an incremental query read from a table, using the 0.12.2 and Flink 1.15, I occasionally get a ClassNotFoundException for `org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat`. This usually happens when running the test from inside Eclipse, occasionally from the command line. **To Reproduce** Steps to reproduce the behavior: 1. `git clone https://github.com/kkrugler/flink-hudi-query-test` 2. `cd flink-hudi-query-test` 3. `git checkout flink-1.15-hudi-0.12` 4. `mvn clean package` **Expected behavior** The tests should all pass. **Environment Description** * Hudi version : 0.12.2 * Flink version : 1.15.1 **Additional context** I believe the problem is that the `hudi-hadoop-mr` dependency on `hive-exec` (with classifier `core`) is marked as provided, but when running a Flink workflow in a typical Flink cluster you don't have Hive jars installed. I think maybe it's OK for `hudi-hadoop-mr` to say this is provided, but `hudi-flink` should then have an explicit dependency on this artifact, something like: ``` <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <classifier>core</classifier> <version>${hive.version}</version> <exclusions> <exclusion> <groupId>*</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> ``` Note the exclusion of all transitive dependencies. All that Hudi needs from `hive-exec` is the one missing class, as Hudi uses HoodieParquetInputFormatBase as the base class, as per: ``` java /** * !!! PLEASE READ CAREFULLY !!! * * NOTE: Hive bears optimizations which are based upon validating whether {@link FileInputFormat} * implementation inherits from {@link MapredParquetInputFormat}. * * To make sure that Hudi implementations are leveraging these optimizations to the fullest, this class * serves as a base-class for every {@link FileInputFormat} implementations working with Parquet file-format. * * However, this class serves as a simple delegate to the actual implementation hierarchy: it expects * either {@link HoodieCopyOnWriteTableInputFormat} or {@link HoodieMergeOnReadTableInputFormat} to be supplied * to which it delegates all of its necessary methods. */ public abstract class HoodieParquetInputFormatBase extends MapredParquetInputFormat implements Configurable { ``` And if you don't do this exclusion, you wind up pulling in lots of additional code that's not needed (AFAICT). **Stacktrace** ``` 23/03/09 10:08:22 INFO executiongraph.ExecutionGraph:1423 - Source: split_monitor(table=[example-table], fields=[event_time, data, enrichment, key, partition]) (1/1) (16f707e9f9462ca1ac57f69e5bc9ae4e) switched from RUNNING to FAILED on 9cbbe102-0f19-48d6-849c-4755cab4fa2d @ localhost (dataPort=-1). java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?] at java.lang.ClassLoader.defineClass(ClassLoader.java:1016) ~[?:?] at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579) ~[?:?] at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?] at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?] at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?] at java.lang.ClassLoader.defineClass(ClassLoader.java:1016) ~[?:?] at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579) ~[?:?] at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?] at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?] at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?] at java.lang.ClassLoader.defineClass(ClassLoader.java:1016) ~[?:?] at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621) ~[?:?] at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579) ~[?:?] at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?] at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?] at org.apache.hudi.sink.partitioner.profile.WriteProfiles.getCommitMetadata(WriteProfiles.java:236) ~[hudi-flink-0.12.2.jar:0.12.2] at org.apache.hudi.source.IncrementalInputSplits.lambda$inputSplits$2(IncrementalInputSplits.java:285) ~[hudi-flink-0.12.2.jar:0.12.2] at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?] at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) ~[?:?] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?] at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?] at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) ~[?:?] at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?] at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) ~[?:?] at org.apache.hudi.source.IncrementalInputSplits.inputSplits(IncrementalInputSplits.java:285) ~[hudi-flink-0.12.2.jar:0.12.2] at org.apache.hudi.source.StreamReadMonitoringFunction.monitorDirAndForwardSplits(StreamReadMonitoringFunction.java:199) ~[hudi-flink-0.12.2.jar:0.12.2] at org.apache.hudi.source.StreamReadMonitoringFunction.run(StreamReadMonitoringFunction.java:172) ~[hudi-flink-0.12.2.jar:0.12.2] at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110) ~[flink-streaming-java-1.15.1.jar:1.15.1] at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) ~[flink-streaming-java-1.15.1.jar:1.15.1] at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:332) ~[flink-streaming-java-1.15.1.jar:1.15.1] Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581) ~[?:?] at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?] at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?] ... 42 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org