creactiviti opened a new issue #1670:
URL: https://github.com/apache/hudi/issues/1670


   I'm attempting to execute the CDC example scenario 
(http://hudi.apache.org/blog/change-capture-using-aws/) on Amazon EMR (5.30.0) 
and running into an issue when attempting to query the table using Presto.
   
   1. Have DMS generate the raw `.parquet` files in S3.
   2. Use `HoodieDeltaStreamer` to process the raw `.parquet` files:
   
   ```
   spark-submit --jars /usr/lib/spark/external/lib/spark-avro.jar  \
                          --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
                          --master yarn \
                          ---mode client 
/usr/lib/hudi/hudi-utilities-bundle.jar \
                          --table-type COPY_ON_WRITE   \
                          --source-ordering-field updated_at   \
                          --source-class 
org.apache.hudi.utilities.sources.ParquetDFSSource   \
                          --target-base-path s3://my-test-bucket/hudi_orders \
                          --target-table hudi_orders   \
                          --transformer-class 
org.apache.hudi.utilities.transform.AWSDmsTransformer   \
                          --payload-class 
org.apache.hudi.payload.AWSDmsAvroPayload  \
                          --enable-hive-sync \
                          --hoodie-conf 
hoodie.datasource.write.recordkey.field=order_id,hoodie.datasource.write.partitionpath.field=customer_name,hoodie.deltastreamer.source.dfs.root=s3:/my-test-bucket/hudi_dms/orders,hoodie.datasource.hive_sync.table=orders,hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor,hoodie.datasource.hive_sync.partition_fields=customer_name
   ```
   
   * Hudi version : 0.5.2 (incubating)
   
   * Spark version : 2.4.5
   
   * Hive version : 2.3.6
   
   * Presto version: 0.232
   
   * Hadoop version : Amazon 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   **Querying using Hive**
   
   Running on Hive:
   
   ```
   hive> select count(*) from orders;
   Query ID = root_20200526144157_e4b7cb38-be47-44e0-8317-8aa87c419995
   Total jobs = 1
   Launching Job 1 out of 1
   Status: Running (Executing on YARN cluster with App id 
application_1590502613834_0007)
   
   
----------------------------------------------------------------------------------------------
           VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING 
 FAILED  KILLED  
   
----------------------------------------------------------------------------------------------
   Map 1 .......... container     SUCCEEDED      1          1        0        0 
      0       0  
   Reducer 2 ...... container     SUCCEEDED      1          1        0        0 
      0       0  
   
----------------------------------------------------------------------------------------------
   VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 10.32 s  
  
   
----------------------------------------------------------------------------------------------
   OK
   7
   Time taken: 12.039 seconds, Fetched: 1 row(s)
   ```
   
   **Querying using Presto**
   
   ```
   presto:default> select count(*) from orders;
   
   Query 20200526_144243_00006_f8j6h, FAILED, 2 nodes
   Splits: 24 total, 0 done (0.00%)
   0:01 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20200526_144243_00006_f8j6h failed: Error opening Hive split 
s3://my-test-bucket/hudi_orders/nathan/b8fd6f7b-0bf5-458b-8cbb-f11e0ede995e-0_1-23-12020_20200526143655.parquet
 (offset=0, length=435285): Unknown converted type TIMESTAMP_MICROS
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to