[GitHub] [hudi] sassai opened a new issue #2145: [SUPPORT] IOException when querying Hudi data with Hive using LIMIT clause

GitBox Mon, 05 Oct 2020 06:54:15 -0700


sassai opened a new issue #2145:
URL: https://github.com/apache/hudi/issues/2145



   **Describe the problem you faced**
   
   Running a query in Hive on Hudi data using LIMIT clause results in 
IOException. 
   
   ```console
   java.io.IOException: Input path does not exist: 
abfs://x...@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=5/.hoodie_partition_metadata
   ```
   
   The `.hoodie_partition_metadata` files does exist and can be listed with 
`hdfs dfs -ls ` using the path above.
   
   Example query used:
   
   select * from nyc_taxi.address limit 100;
   
   Running the same query without the limit clause works fine.
   
   HIVE_AUX_JAR variable holds `hudi-utilities-bundle_2.11-0.6.0.jar` and 
`hudi-hadoop-mr-bundle-0.6.0.jar`
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a COPY_ON_WRITE table 
   2. Insert records to table (table has 11 million records)
   3. set hive.fetch.task.conversion=none;
   4. Query the table using the statement above
   5. IOException is thrown
   
   **Expected behavior**
   
   Resultset containing 100 records is returned.
   
   **Environment Description**
   
   * Hudi version : 0.6.0
   
   * Spark version : 2.4.0
   
   * Hive version : 3.1
   
   * Hadoop version : 3
   
   * Storage (HDFS/S3/GCS..) : ADLS Gen2
   
   * Running on Docker? (yes/no) : no
   
   **Stacktrace**
   
   ```console
   Error while compiling statement: FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, 
vertexId=vertex_1601881880788_0031_6_00, 
   diagnostics=[Vertex vertex_1601881880788_0031_6_00 [Map 1] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: address initializer failed, 
vertex=vertex_1601881880788_0031_6_00 [Map 1],
    org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
abfs://x...@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=5/.hoodie_partition_metadata
 at 
    
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:300)
 at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:240) 
at 
    
org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:105)
 at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:328) at 
    
org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:541)
 at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:830)
 at 
    
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:249)
 at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:280)
 at 
    
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:271)
 at java.security.AccessController.doPrivileged(Native Method) at 
    javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
 at 
    
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:271)
 at 
    
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:255)
 at 
    
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
 at 
    
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
 at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
 at 
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Caused by: 
    java.io.IOException: Input path does not exist: 
abfs://x...@xxx.dfs.core.windows.net/data/hudi/batch/tables/nyc_taxi/address/year=2020/month=10/day=5/.hoodie_partition_metadata
 at 
    
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:274)
 ... 19 more ]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 
killedVertices:0
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] sassai opened a new issue #2145: [SUPPORT] IOException when querying Hudi data with Hive using LIMIT clause

Reply via email to