creactiviti opened a new issue #1670: URL: https://github.com/apache/hudi/issues/1670
I'm attempting to execute the CDC example scenario (http://hudi.apache.org/blog/change-capture-using-aws/) on Amazon EMR (5.30.0) and running into an issue when attempting to query the table using Presto. 1. Have DMS generate the raw `.parquet` files in S3. 2. Use `HoodieDeltaStreamer` to process the raw `.parquet` files: ``` spark-submit --jars /usr/lib/spark/external/lib/spark-avro.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --master yarn \ ---mode client /usr/lib/hudi/hudi-utilities-bundle.jar \ --table-type COPY_ON_WRITE \ --source-ordering-field updated_at \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3://my-test-bucket/hudi_orders \ --target-table hudi_orders \ --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \ --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \ --enable-hive-sync \ --hoodie-conf hoodie.datasource.write.recordkey.field=order_id,hoodie.datasource.write.partitionpath.field=customer_name,hoodie.deltastreamer.source.dfs.root=s3:/my-test-bucket/hudi_dms/orders,hoodie.datasource.hive_sync.table=orders,hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor,hoodie.datasource.hive_sync.partition_fields=customer_name ``` * Hudi version : 0.5.2 (incubating) * Spark version : 2.4.5 * Hive version : 2.3.6 * Presto version: 0.232 * Hadoop version : Amazon 2.8.5 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No **Querying using Hive** Running on Hive: ``` hive> select count(*) from orders; Query ID = root_20200526144157_e4b7cb38-be47-44e0-8317-8aa87c419995 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1590502613834_0007) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 10.32 s ---------------------------------------------------------------------------------------------- OK 7 Time taken: 12.039 seconds, Fetched: 1 row(s) ``` **Querying using Presto** ``` presto:default> select count(*) from orders; Query 20200526_144243_00006_f8j6h, FAILED, 2 nodes Splits: 24 total, 0 done (0.00%) 0:01 [0 rows, 0B] [0 rows/s, 0B/s] Query 20200526_144243_00006_f8j6h failed: Error opening Hive split s3://my-test-bucket/hudi_orders/nathan/b8fd6f7b-0bf5-458b-8cbb-f11e0ede995e-0_1-23-12020_20200526143655.parquet (offset=0, length=435285): Unknown converted type TIMESTAMP_MICROS ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
