kkrugler opened a new issue, #8147:
URL: https://github.com/apache/hudi/issues/8147

   **Describe the problem you faced**
   
   When using Flink to do an incremental query read from a table, using the 
0.12.2 and Flink 1.15, I occasionally get a ClassNotFoundException for 
`org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat`. This usually 
happens when running the test from inside Eclipse, occasionally from the 
command line.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. `git clone https://github.com/kkrugler/flink-hudi-query-test`
   2. `cd flink-hudi-query-test`
   3. `git checkout flink-1.15-hudi-0.12`
   4. `mvn clean package`
   
   **Expected behavior**
   
   The tests should all pass.
   
   **Environment Description**
   
   * Hudi version : 0.12.2
   
   * Flink version : 1.15.1
   
   **Additional context**
   
   I believe the problem is that the `hudi-hadoop-mr` dependency on `hive-exec` 
(with classifier `core`) is marked as provided, but when running a Flink 
workflow in a typical Flink cluster you don't have Hive jars installed. I think 
maybe it's OK for `hudi-hadoop-mr` to say this is provided, but `hudi-flink` 
should then have an explicit dependency on this artifact, something like:
   
   ```
           <dependency>
               <groupId>org.apache.hive</groupId>
               <artifactId>hive-exec</artifactId>
               <classifier>core</classifier>
               <version>${hive.version}</version>
               <exclusions>
                   <exclusion>
                       <groupId>*</groupId>
                       <artifactId>*</artifactId>
                   </exclusion>
               </exclusions>
           </dependency>
   ```
   
   Note the exclusion of all transitive dependencies. All that Hudi needs from 
`hive-exec` is the one missing class, as Hudi uses HoodieParquetInputFormatBase 
as the base class, as per:
   
   ``` java
   /**
    * !!! PLEASE READ CAREFULLY !!!
    *
    * NOTE: Hive bears optimizations which are based upon validating whether 
{@link FileInputFormat}
    * implementation inherits from {@link MapredParquetInputFormat}.
    *
    * To make sure that Hudi implementations are leveraging these optimizations 
to the fullest, this class
    * serves as a base-class for every {@link FileInputFormat} implementations 
working with Parquet file-format.
    *
    * However, this class serves as a simple delegate to the actual 
implementation hierarchy: it expects
    * either {@link HoodieCopyOnWriteTableInputFormat} or {@link 
HoodieMergeOnReadTableInputFormat} to be supplied
    * to which it delegates all of its necessary methods.
    */
   public abstract class HoodieParquetInputFormatBase extends 
MapredParquetInputFormat implements Configurable {
   ```
   
   And if you don't do this exclusion, you wind up pulling in lots of 
additional code that's not needed (AFAICT).
   
   **Stacktrace**
   
   ```
   23/03/09 10:08:22 INFO executiongraph.ExecutionGraph:1423 - Source: 
split_monitor(table=[example-table], fields=[event_time, data, enrichment, key, 
partition]) (1/1) (16f707e9f9462ca1ac57f69e5bc9ae4e) switched from RUNNING to 
FAILED on 9cbbe102-0f19-48d6-849c-4755cab4fa2d @ localhost (dataPort=-1).
   java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat
        at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
        at java.lang.ClassLoader.defineClass(ClassLoader.java:1016) ~[?:?]
        at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800) 
~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698)
 ~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621)
 ~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579) 
~[?:?]
        at 
jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
 ~[?:?]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
        at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
        at java.lang.ClassLoader.defineClass(ClassLoader.java:1016) ~[?:?]
        at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800) 
~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698)
 ~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621)
 ~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579) 
~[?:?]
        at 
jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
 ~[?:?]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
        at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
        at java.lang.ClassLoader.defineClass(ClassLoader.java:1016) ~[?:?]
        at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800) 
~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698)
 ~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621)
 ~[?:?]
        at 
jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579) 
~[?:?]
        at 
jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
 ~[?:?]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
        at 
org.apache.hudi.sink.partitioner.profile.WriteProfiles.getCommitMetadata(WriteProfiles.java:236)
 ~[hudi-flink-0.12.2.jar:0.12.2]
        at 
org.apache.hudi.source.IncrementalInputSplits.lambda$inputSplits$2(IncrementalInputSplits.java:285)
 ~[hudi-flink-0.12.2.jar:0.12.2]
        at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]
        at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) 
~[?:?]
        at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
        at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) 
~[?:?]
        at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) 
~[?:?]
        at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
        at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) ~[?:?]
        at 
org.apache.hudi.source.IncrementalInputSplits.inputSplits(IncrementalInputSplits.java:285)
 ~[hudi-flink-0.12.2.jar:0.12.2]
        at 
org.apache.hudi.source.StreamReadMonitoringFunction.monitorDirAndForwardSplits(StreamReadMonitoringFunction.java:199)
 ~[hudi-flink-0.12.2.jar:0.12.2]
        at 
org.apache.hudi.source.StreamReadMonitoringFunction.run(StreamReadMonitoringFunction.java:172)
 ~[hudi-flink-0.12.2.jar:0.12.2]
        at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
 ~[flink-streaming-java-1.15.1.jar:1.15.1]
        at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) 
~[flink-streaming-java-1.15.1.jar:1.15.1]
        at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:332)
 ~[flink-streaming-java-1.15.1.jar:1.15.1]
   Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
        at 
jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581) 
~[?:?]
        at 
jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
 ~[?:?]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:521) ~[?:?]
        ... 42 more
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to