rubenssoto opened a new issue #1878: URL: https://github.com/apache/hudi/issues/1878
Hi, how are you? Im using EMR 5.30.1, spark 2.4.5, hudi 0.5.2 and my data is store in S3. Since today Im trying to migrate some of our datasets in production to apache hudi, Im having problems with the first, could you help me please? It is a small dataset, 26gb distributed by 89 parquet files. Im reading the data with structured streaming, reading 4 files per trigger, when I write the stream in a regular parquet, works, but if I use hudi doenst work. This is my hudi options, I tryed with or without shuffle options, I need files more than 500mb with max 1000mb hudi_options = { 'hoodie.table.name': tableName, 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'event_date', 'hoodie.datasource.write.table.name': tableName, 'hoodie.datasource.write.operation': 'insert', 'hoodie.datasource.write.precombine.field': 'LineCreatedTimestamp', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.parquet.small.file.limit': 500000000, 'hoodie.parquet.max.file.size': 800000000, 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': tableName, 'hoodie.datasource.hive_sync.database': 'datalake_raw', 'hoodie.datasource.hive_sync.partition_fields': 'event_date', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.hive_sync.jdbcurl':'jdbc:hive2://<EMR HOST>:10000', 'hoodie.insert.shuffle.parallelism': 20, 'hoodie.upsert.shuffle.parallelism': 20 } My read and write functions: def read_parquet_stream(spark_session, read_folder_path, data_schema, max_files_per_trigger): spark = spark_session df = spark \ .readStream \ .option("maxFilesPerTrigger", max_files_per_trigger) \ .schema(data_schema) \ .parquet(read_folder_path) return df def write_hudi_dataset_stream(spark_data_frame, checkpoint_location_folder, write_folder_path, hudi_options): df_write_query = spark_data_frame \ .writeStream \ .options(**hudi_options) \ .trigger(processingTime='20 seconds') \ .outputMode('append') \ .format('hudi')\ .option("checkpointLocation", checkpoint_location_folder) \ .start(write_folder_path) df_write_query.awaitTermination() I caught some errors: Job aborted due to stage failure: Task 11 in stage 2.0 failed 4 times, most recent failure: Lost task 11.3 in stage 2.0 (TID 53, ip-10-0-87-171.us-west-2.compute.internal, executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 6.3 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714. My cluster is small, but the data is small too master: 4 cores and 16gb ram nodes: 2 nodes with 4 cores and 16gb each If I write stream in a regular parquet takes 38min to finish the job, but in hudi it have been passed more then one hour and half and job haven't finished yet. Could you help me? I need to put this job in production as soon as possible. Thank you Guys!!! <img width="1680" alt="Captura de Tela 2020-07-24 às 17 26 45" src="https://user-images.githubusercontent.com/36298331/88447482-27be5000-ce0a-11ea-889a-c5f1042fbe98.png"> <img width="1680" alt="Captura de Tela 2020-07-25 às 00 03 26" src="https://user-images.githubusercontent.com/36298331/88447495-4cb2c300-ce0a-11ea-9482-7965d7646476.png"> <img width="1680" alt="Captura de Tela 2020-07-25 às 00 05 00" src="https://user-images.githubusercontent.com/36298331/88447510-81267f00-ce0a-11ea-9311-38a395390d6b.png"> ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org