Re: [I] [SUPPORT] Hudi Metadata Compaction is not happening [hudi]
danny0405 commented on issue #11535: URL: https://github.com/apache/hudi/issues/11535#issuecomment-2226621918 Yeah, the MDT delta_commit is archived based on its own strategy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Metadata Compaction is not happening [hudi]
Jason-liujc commented on issue #11535: URL: https://github.com/apache/hudi/issues/11535#issuecomment-2226519310 Update: Tried cleaning the table synchronously instead of asynchronously, we can see the compaction commit after the second run. Seems the first run fixed a lot of pending commits we had in the table: ``` Obtaining marker files for all created, merged paths Perform rollback actions: componentoutputs_discretionarycoop ``` and the second run ran: ``` Preparing compaction metadata: componentoutputs_discretionarycoop_metadata ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Hudi Failed to read MARKERS file [hudi]
danny0405 commented on issue #6900: URL: https://github.com/apache/hudi/issues/6900#issuecomment-2221758340 > Could not read commit details from hdfs://hacluster/user/kylin/flink/data/streaming_rdss_rcsp_lab/2024062815382133 Is this a real file on storage? Did you check the integrity of it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Hudi Failed to read MARKERS file [hudi]
fanfanAlice commented on issue #6900: URL: https://github.com/apache/hudi/issues/6900#issuecomment-2219857140 yes set hoodie.embed.timeline.server=false -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi-common 0.14.0 jar in mavenCentral appears to have corrupt generated avro classes [hudi]
lucasmo commented on issue #11602: URL: https://github.com/apache/hudi/issues/11602#issuecomment-2218542973 https://github.com/apache/hudi/issues/11378 appears to be caused by this same issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] hudi-common 0.14.0 jar in mavenCentral appears to have corrupt generated avro classes [hudi]
lucasmo opened a new issue, #11602: URL: https://github.com/apache/hudi/issues/11602 **Describe the problem you faced** When diagnosing a problem with XTable (see https://github.com/apache/incubator-xtable/issues/466), I noticed that avro classes were unable to even be instantiated for schema in a very simple test case when using `hudi-common-0.14.0` as a dependency. However, this issue does not exist when using `hudi-spark3.4-bundle_2.12-0.14.0` as a dependency, which contains the same avro autogenerated classes. A good specific example is `org/apache/hudi/avro/model/HoodieCleanPartitionMetadata.class`. When compiling hudi locally (tag `release-0.14.0`, `mvn clean package -DskipTests -Dspark3.4`, java 1.8), both generated jar files have the correct implementations of avro autogenerated classes. **To Reproduce** Steps to reproduce the behavior: 1. Download and uncompress hudi-spark3.4-bundle_2.12-0.14.0.jar and hudi-common-0.14.0.jar from mavencentral 2. Build Hudi locally 3. Run javap on `org/apache/hudi/avro/model/HoodieCleanPartitionMetadata.class` in all four of the jars 4. Note the file size of the text output of javap is 4232 for the file from every single jar aside from hudi-common, which has a javap text file size of 2323. OR run the following in Java 11, replacing $PATH_TO_A_HOODIE_AVRO_MODELS_JAR with a path to one of the four jar files ``` jshell --class-path ~/.m2/repository/org/apache/avro/avro/1.11.3/avro-1.11.3.jar:~/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.17.1/jackson-core-2.17.1.jar:~/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.17.1/jackson-databind-2.17.1.jar:~/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.17.1/jackson-annotations-2.17.1.jar:~/.m2/repository/org/slf4j/slf4j-api/2.0.9/slf4j-api-2.0.9.jar:$PATH_TO_A_HOODIE_AVRO_MODELS_JAR ``` Then, copy and paste this into the shell: ``` org.apache.avro.Schema schema = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"HoodieCleanPartitionMetadata\",\"namespace\":\"org.apache.hudi.avro.model\",\"fields\":[{\"name\":\"partitionPath\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"policy\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"deletePathPatterns\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}},{\"name\":\"successDeleteFiles\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}},{\"name\":\"failedDeleteFiles\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}},{\"name\":\"isPartitionDeleted\",\"type\":[\"null\",\"boolean\"],\"default\":null}]}"); System.out.println("Class for schema: " + org.apache.avro.specific.SpecificData.get().getClass(schema)); ``` On the MavenCentral hudi-common-0.14.0 jar, you should get: ``` | Exception java.lang.ExceptionInInitializerError |at Class.forName0 (Native Method) |at Class.forName (Class.java:398) ... | Caused by: java.lang.IllegalStateException: Recursive update |at ConcurrentHashMap.computeIfAbsent (ConcurrentHashMap.java:1760) ``` **Expected behavior** The above code snippet prints ``` Class for schema: class org.apache.hudi.avro.model.HoodieCleanPartitionMetadata ``` **Environment Description** * Hudi version : 0.14.0 everything else n/a, but duplicated issue on macOS and Ubuntu 22.04. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Hudi Failed to read MARKERS file [hudi]
danny0405 commented on issue #6900: URL: https://github.com/apache/hudi/issues/6900#issuecomment-2217396181 > the dataset: [=-) Embedded timeline server is disabled Did you disable the embedded timeline server? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Metadata Compaction is not happening [hudi]
danny0405 commented on issue #11535: URL: https://github.com/apache/hudi/issues/11535#issuecomment-2215855839 @Jason-liujc Thanks for these tries, but from high-level, we should definitely simplify the design of the MDT, at least from 1.x, the MDT compaction can work smothly with any async table service now, the next step is to make it NB-CC totally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]
Jason-liujc commented on issue #11535: URL: https://github.com/apache/hudi/issues/11535#issuecomment-2214835602 Had an offline discussion with Shiyan. As long as the metadata table is not compacted properly, the insertion performance will become worse and worse gradually. Here’s some action items we are taking: 1. For future Hudi issues, we’ll try to create github issues first. I’ll create another one for some incremental query errors (but its totally mitigable on our end) 2. For this specific issue on metadata table not being compacted, we’ll try the following a. Run scripts to delete previous uncommitted instants (and any files created if any) and see if the metadata compaction resumes b. Run workload with synchronous cleaning to see if it can compact the metadata table c. After cleaning up pending commits, see if we successfully reinitialize the metadata table Will give an update here on how it goes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]
xushiyan commented on issue #11535: URL: https://github.com/apache/hudi/issues/11535#issuecomment-2211178125 > run compaction of the metadata table asynchrounously no option to do that as MT compaction is managed internally > `hoodie.metadata.max.deltacommits.when_pending` parameter to say like 100 @Jason-liujc this is only a mitigation strategy. To get MT to compact, you need to resolve the pending commit (let it finish or rollback) on data table's timeline. if you email us the zipped `.hoodie/` we can help analyze it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]
danny0405 commented on issue #11535: URL: https://github.com/apache/hudi/issues/11535#issuecomment-2208403502 Did you check that whether data table has a long pending instant there that does not finish? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi 0.12.1 support for Spark Structured Streaming. read clustering metadata replace avro file error. Unrecognized token 'Obj^A^B^Vavro' [hudi]
sdudi-te commented on issue #7375: URL: https://github.com/apache/hudi/issues/7375#issuecomment-2205762639 Is there a possible workaround for this ? In other words how do we recover from this situation ? We are using spark structured streaming on kafka and write output to hudi on s3. Upon deleting the partial commit file (as a workaround), we are observing even though streaming job is progressing with updated offsets, but no data is ever written to hudi. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]
Jason-liujc commented on issue #11535: URL: https://github.com/apache/hudi/issues/11535#issuecomment-2204845455 @danny0405 Ahh gotcha, we do have async cleaner that runs for our Hudi tables. @ad1happy2go I don't see any compaction on metadata table since a given date (I believe that's when we moved Hudi cleaning from sync to async, based on Danny's comment). When I delete the metadata and try to reinitialize I do see this error, which I believe they are the blocking instants: ``` 24/06/15 01:06:20 ip-10-0-157-87 WARN HoodieBackedTableMetadataWriter: Cannot initialize metadata table as operation(s) are in progress on the dataset: [[==>20240523221631416__commit__INFLIGHT__20240523224939000], [==>20240523225648799__commit__INFLIGHT__20240523232254000], [==>20240524111304660__commit__INFLIGHT__20240524142426000], [==>20240524235127638__commit__INFLIGHT__2024052500064], [==>20240525005114829__commit__INFLIGHT__20240525011802000], [==>20240525065356540__commit__INFLIGHT__20240525071004000], [==>20240525170219523__commit__INFLIGHT__20240525192315000], [==>20240527184608604__commit__INFLIGHT__20240527190327000], [==>20240528190417601__commit__INFLIGHT__20240528192418000], [==>20240529054718316__commit__INFLIGHT__20240529060542000], [==>20240530125710177__commit__INFLIGHT__20240531081522000], [==>20240530234238360__commit__INFLIGHT__20240530234726000], [==>20240531082713041__commit__REQUESTED__20240531082715000], [==>20240601164223688__commit__INFLIGHT__2024060 1190853000], [==>20240602072248313__commit__INFLIGHT__20240603005951000], [==>20240603010859993__commit__INFLIGHT__20240603100305000], [==>20240604043334594__commit__INFLIGHT__20240604061732000], [==>20240605061406367__commit__REQUESTED__20240605061412000], [==>20240605063936872__commit__REQUESTED__20240605063943000], [==>20240605071904045__commit__REQUESTED__2024060507191], [==>20240605074456040__commit__REQUESTED__20240605074502000], [==>20240605082437667__commit__REQUESTED__20240605082443000], [==>20240605085008272__commit__REQUESTED__20240605085014000], [==>20240605123632368__commit__REQUESTED__20240605123638000], [==>20240605130201503__commit__REQUESTED__20240605130207000], [==>20240605134213113__commit__REQUESTED__20240605134219000], [==>20240605140741158__commit__REQUESTED__20240605140747000], [==>20240605144756228__commit__REQUESTED__20240605144802000], [==>20240605151313557__commit__REQUESTED__20240605151319000], [==>20240605195405678__commit__REQUESTED__202406051954110 00], [==>20240605202017653__commit__REQUESTED__20240605202023000], [==>20240605205949232__commit__REQUESTED__20240605205955000], [==>20240605212536568__commit__REQUESTED__20240605212542000], [==>20240605220432089__commit__REQUESTED__20240605220438000], [==>20240606152537217__commit__INFLIGHT__20240607031027000], [==>20240606181110800__commit__INFLIGHT__2024060843000], [==>20240607112530977__commit__INFLIGHT__20240607212013000], [==>20240607213124841__commit__INFLIGHT__20240609024214000], [==>20240608001245366__commit__INFLIGHT__2024060904553], [==>20240609030620894__commit__INFLIGHT__2024060918031], [==>20240609181330488__commit__REQUESTED__20240609181336000], [==>20240609194304829__commit__INFLIGHT__20240611095337000], [==>20240611003906613__commit__INFLIGHT__20240611014341000], [==>20240611100258837__commit__INFLIGHT__20240612075536000], [==>20240611174425406__commit__INFLIGHT__20240611184626000], [==>20240612081821910__commit__INFLIGHT__20240612102427000], [==>2024061 2204659323__commit__REQUESTED__20240612204705000], [==>20240613044301243__commit__INFLIGHT__20240613075101000], [==>20240613085334404__commit__INFLIGHT__20240613105718000], [==>20240613113055212__commit__REQUESTED__20240613113101000], [==>20240613122745696__commit__REQUESTED__20240613122751000], [==>20240614094542418__commit__REQUESTED__20240614094548000], [==>20240614172456990__commit__REQUESTED__20240614172503000], [==>20240614175526954__commit__REQUESTED__20240614175529000], [==>20240614181441857__commit__REQUESTED__20240614181444000], [==>20240614222012190__commit__REQUESTED__20240614222015000], [==>20240614225952031__commit__REQUESTED__20240614225954000], [==>20240614235545094__commit__REQUESTED__20240614235547000]] ``` I guess my next questions are: 1. Is there a way to run compaction of the metadata table asynchrounously, without cleaning up commits, deleting metadata table and recreating them again? The process is a bit expensive and since based on what Danny said, the going forward metadata table compaction still won't work. 2. Also if we just increase the `hoodie.metadata.max.deltacommits.when_pending` parameter to say like 100, what type of performance hit would we expect it take? is it mostly on the S3 file listing level? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to
Re: [I] [SUPPORT] Hudi Metadata Compaction is not happeing [hudi]
danny0405 commented on issue #11535: URL: https://github.com/apache/hudi/issues/11535#issuecomment-2197789686 This is an known issue, probably because you have enabled async table service on data table, the 0.x Hudi metadata table does not work with any async table services, that would cause the MDT not compaction issue, and it is fixed on master now, with our new completion time based file slicing and non-blocking style concurrency control. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Hudi insert job failded due to multiple files belongs to the same bucket id [hudi]
beyond1920 commented on issue #11527: URL: https://github.com/apache/hudi/issues/11527#issuecomment-2197044665 @danny0405 @dongtingting Good point. I think your analysis is reasonable. Generate fileid in driver could avoid different fg id for the same bucket id, but it might cost too much memory for some cases. + @xushiyan WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]
Limess commented on issue #8114: URL: https://github.com/apache/hudi/issues/8114#issuecomment-2185835901 > @codope :As stated by the Issue, the problem is a necessary occurrence. The version we are currently using is 0.14. @Limess :Have you not encountered this problem again? May I ask how was it avoided?Thanks! We never pursued this and are still on 0.13.0 for now, so I can't verify either way, sorry! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]
codope commented on issue #8114: URL: https://github.com/apache/hudi/issues/8114#issuecomment-2185495368 @zhaobangcai The full context is that the issue was fixed but in order to fix, the archived timeline was also being read. This caused too high sync latency. Hence, the fix was reverted. Generally, reading archived timeline is an anti-pattern in Hudi, and we are optimizing this by implemeting LSM timeline in 1.0.0. That said, I think we did fix the timeline loading in https://github.com/apache/hudi/commit/ab61f61df9686793406300c0018924a119b02855 which I believe is in 0.14. Can you please share a script/test case to reproduce the issue with all configs that you used in your env? I am going to reopen the issue based on your comment and debug further once you provide the script/test case. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]
Limess opened a new issue, #8114: URL: https://github.com/apache/hudi/issues/8114 **Describe the problem you faced** After running an insert to overwrite a Hudi table inplace using `insert_overwrite_table`, partitions which no longer exist in the new input data are not removed by the Hive Sync. This causes some query engines to fail until the old partitions are manually removed (e.g. AWS Athena). This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 0.13.0 - this change: https://github.com/apache/hudi/pull/6662 fixes this behaviour for `delete_partition` operations, but doesn't add any handling for `insert_overwrite_table`. I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't have an environment to easily test this without working out how to upgrade on EMR without a release. **To Reproduce** Steps to reproduce the behavior: 1. Create a new Hudi table using input data with two partitions, e.g. partition_col=1, partition_col=2 2. Insert into the table using the operation `hoodie.datasource.write.operation=insert_overwrite_table` with input data containing 1/2 of the original partitions, e.g. only partition_col=2 3. Run HiveSyncTool or similar (doesn't work with Spark writer sync or HiveSyncTool) 4. Check the Hive partitions. Both partitions still exist **Expected behavior** I'd expect the partition which was not inserted to be removed, e.g. only partition_col=2 exists, partition_col=1 is deleted. **Environment Description** * Hudi version : 0.12.1 * Spark version : 3.3.1 * Hive version : AWS Glue * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Running on EMR 0.6.9 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]
zhaobangcai commented on issue #8114: URL: https://github.com/apache/hudi/issues/8114#issuecomment-2185478617 @codope :As stated by the Issue, the problem is a necessary occurrence. The version we are currently using is 0.14. @Limess :Have you not encountered this problem again? May I ask how was it avoided?Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]
zhaobangcai commented on issue #8114: URL: https://github.com/apache/hudi/issues/8114#issuecomment-2185411734 @codope :hello,this issue still exists in version 0.14, why was it closed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] [hudi]
ashwinagalcha-ps opened a new issue, #11468: URL: https://github.com/apache/hudi/issues/11468 When using Kafka + Debezium + Streamer, we are able to write data and the job works fine, but when using the SqlQueryBasedTransformer, it is able to write data on S3 with the new field but ultimately the job fails. Below are the Hudi Deltastreamer job configs: ```"--table-type", "COPY_ON_WRITE", "--source-class", "org.apache.hudi.utilities.sources.debezium.PostgresDebeziumSource", "--transformer-class", "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer", "--hoodie-conf", "hoodie.streamer.transformer.sql=SELECT *, extract(year from a.created_at) as year FROM a", "--source-ordering-field", output["source_ordering_field"], "--target-base-path", f"s3a://{env_params['deltastreamer_bucket']}/{db_name}/{schema}/{output['table_name']}/", "--target-table", output["table_name"], "--auto.offset.reset=earliest "--props", properties_file, "--payload-class", "org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload", "--enable-hive-sync", "--hoodie-conf", "hoodie.datasource.hive_sync.mode=hms", "--hoodie-conf", "hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true", "--hoodie-conf", f"hoodie.deltastreamer.source.kafka.topic={connector_name}.{schema}.{output['table_name']}", "--hoodie-conf", f"schema.registry.url={env_params['schema_registry_url']}", "--hoodie-conf", f"hoodie.deltastreamer.schemaprovider.registry.url={env_params['schema_registry_url']}/subjects/{connector_name}.{schema}.{output['table_name']}-value/versions/latest", "--hoodie-conf", "hoodie.deltastreamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer", "--hoodie-conf", "hoodie.datasource.hive_sync.use_jdbc=false", "--hoodie-conf", f"hoodie.datasource.hive_sync.database={output['hive_database']}", "--hoodie-conf", f"hoodie.datasource.hive_sync.table={output['table_name']}", "--hoodie-conf", "hoodie.datasource.hive_sync.metastore.uris=", "--hoodie-conf", "hoodie.datasource.hive_sync.enable=true", "--hoodie-conf", "hoodie.datasource.hive_sync.support_timestamp=true", "--hoodie-conf", "hoodie.deltastreamer.source.kafka.maxEvents=10", "--hoodie-conf", f"hoodie.datasource.write.recordkey.field={output['record_key']}", "--hoodie-conf", f"hoodie.datasource.write.precombine.field={output['precombine_field']}", "--hoodie-conf", f"hoodie.datasource.hive_sync.glue_database={output['hive_database']}", "--continuous"``` Properties file: ```bootstrap.servers= auto.offset.reset=earliest schema.registry.url=http://host:8081``` **Expected behavior**: To be able to extract a new field (year) in the target hudi table with the help of SqlQueryBasedTransformer. **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.4.1 * Hadoop version : 3.3.4 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no * Base image & jars: `public.ecr.aws/ocean-spark/spark:platform-3.4.1-hadoop-3.3.4-java-11-scala-2.12-python-3.10-gen21` `https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/0.14.0/hudi-spark3.4-bundle_2.12-0.14.0.jar https://repo1.maven.org/maven2/org/apache/hudi/hudi-utilities-bundle_2.12/0.14.0/hudi-utilities-bundle_2.12-0.14.0.jar` **Stacktrace** ```2024-06-14T14:16:17.562738557Z 24/06/14 14:16:17 ERROR HoodieStreamer: Shutting down delta-sync due to exception 2024-06-14T14:16:17.562785897Z org.apache.hudi.utilities.exception.HoodieTransformExecutionException: Failed to apply sql query based transformer 2024-06-14T14:16:17.562797467Z at org.apache.hudi.utilities.transform.SqlQueryBasedTransformer.apply(SqlQueryBasedTransformer.java:68) 2024-06-14T14:16:17.562805097Z at org.apache.hudi.utilities.transform.ChainedTransformer.apply(ChainedTransformer.java:105) 2024-06-14T14:16:17.562812197Z at org.apache.hudi.utilities.streamer.StreamSync.lambda$fetchFromSource$0(StreamSync.java:530) 2024-06-14T14:16:17.562819517Z at org.apache.hudi.common.util.Option.map(Option.java:108) 2024-06-14T14:16:17.562826327Z at org.apache.hudi.utilities.streamer.StreamSync.fetchFromSource(StreamSync.java:530) 2024-06-14T14:16:17.562836838Z at org.apache.hudi.utilities.streamer.StreamSync.readFromSource(StreamSync.java:495) 2024-06-14T14:16:17.562844648Z at org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:405) 2024-06-14T14:16:17.562852958Z at org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:757) 2024-06-14T14:16:17.562860358Z at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) 2024-06-14T14:16:17.562868059Z at
Re: [I] [SUPPORT] [hudi]
codope closed issue #11431: [SUPPORT] URL: https://github.com/apache/hudi/issues/11431 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] [hudi]
ad1happy2go commented on issue #11431: URL: https://github.com/apache/hudi/issues/11431#issuecomment-2165415200 @zaminhassnain06 Closing this issue. Please reopen or create new one for any more doubts on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] [hudi]
zaminhassnain06 commented on issue #11431: URL: https://github.com/apache/hudi/issues/11431#issuecomment-2163145963 Thanks @ad1happy2go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] [hudi]
ad1happy2go commented on issue #11431: URL: https://github.com/apache/hudi/issues/11431#issuecomment-2162943432 @zaminhassnain06 Correct. you need to rebuild. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] [hudi]
zaminhassnain06 commented on issue #11431: URL: https://github.com/apache/hudi/issues/11431#issuecomment-2162768973 @ad1happy2go yes, data type of our columns are changing mostly from int to big int as our data is increasing. So in this scenario we should directly move towards higher version and rebuild our tables in the higher version , is that correct ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] [hudi]
ad1happy2go commented on issue #11431: URL: https://github.com/apache/hudi/issues/11431#issuecomment-2162750491 @zaminhassnain06 Did data type of your id column changed? Why you need to run alter command? We can't make an integer field as long as it's not backward compatible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Application getting stuck when Async cleaner is spawned [hudi]
JuanAmayaBT commented on issue #7364: URL: https://github.com/apache/hudi/issues/7364#issuecomment-2161690376 any news on this? I am using **hudi 0.14.1 on aws glue** and getting from time to time the following error that seems to be related to this issue: Error waiting for async clean service to finish ``` spark_df.write.format('hudi').options(**hudi_final_settings).mode('Append').save() File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 966, in save self._jwrite.save() File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ return_value = get_return_value( File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o169.save. : org.apache.hudi.exception.HoodieException: Error waiting for async clean service to finish at org.apache.hudi.async.AsyncCleanerService.waitForCompletion(AsyncCleanerService.java:77) at org.apache.hudi.client.BaseHoodieTableServiceClient.asyncClean(BaseHoodieTableServiceClient.java:133) at org.apache.hudi.client.BaseHoodieWriteClient.autoCleanOnCommit(BaseHoodieWriteClient.java:595) at org.apache.hudi.client.BaseHoodieWriteClient.mayBeCleanAndArchive(BaseHoodieWriteClient.java:579) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:248) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:104) at org.apache.hudi.HoodieSparkSqlWriterInternal.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1081) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:520) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:204) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:103) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:138) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:100) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:96) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:615) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:177) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:615) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at
[I] [SUPPORT] [hudi]
zaminhassnain06 opened a new issue, #11431: URL: https://github.com/apache/hudi/issues/11431 Hi, I tried to update my hudi version from 0.6.0 to 0.11. I updated it gradually version by version starting from 0.6 to 0.9 and then from 0.9 to 0.10 and finally from 0.10 to 0.11. I am running it using EMR and querying the hudi table on athena. The table version was updated correctly after each update in hoodie.properties file on S3. However when I tried to Run Alter table command to Alter data type of colum from int to big in on 0.11 it is giving me folowing error `pyspark.sql.utils.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'id' with type 'IntegerType' to 'id' with type 'LongType'` Do we have to rebuild the tables to the newer version directly ??? Following is my hoodie.properties file content hoodie.table.timeline.timezone=LOCAL hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator hoodie.table.precombine.field=when_updated hoodie.table.version=4 hoodie.database.name= hoodie.datasource.write.hive_style_partitioning=true hoodie.table.checksum=2716619607 hoodie.partition.metafile.use.base.format=false hoodie.archivelog.folder=archived hoodie.table.name=amz_hudi_vc_11_accounts hoodie.populate.meta.fields=true hoodie.table.type=COPY_ON_WRITE hoodie.datasource.write.partitionpath.urlencode=false hoodie.table.base.file.format=PARQUET hoodie.datasource.write.drop.partition.columns=false hoodie.table.metadata.partitions=files hoodie.timeline.layout.version=1 hoodie.table.recordkey.fields=id hoodie.table.partition.fields= Following is the complete error An error was encountered: ALTER TABLE CHANGE COLUMN is not supported for changing column 'id' with type 'IntegerType' to 'id' with type 'LongType' Traceback (most recent call last): File "/mnt/yarn/usercache/livy/appcache/application_1718084941070_0004/container_1718084941070_0004_01_01/pyspark.zip/pyspark/sql/session.py", line 723, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/mnt/yarn/usercache/livy/appcache/application_1718084941070_0004/container_1718084941070_0004_01_01/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1322, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/mnt/yarn/usercache/livy/appcache/application_1718084941070_0004/container_1718084941070_0004_01_01/pyspark.zip/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'id' with type 'IntegerType' to 'id' with type 'LongType' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]
ad1happy2go commented on issue #11273: URL: https://github.com/apache/hudi/issues/11273#issuecomment-2155139922 @SuneethaYamani https://hudi.apache.org/docs/configurations/#hoodiemetadataenable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]
KnightChess commented on issue #11204: URL: https://github.com/apache/hudi/issues/11204#issuecomment-2153754087 > @KnightChess do you have intreast to push-forward this feature? @danny0405 yes, I follow up this problem -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]
danny0405 commented on issue #11204: URL: https://github.com/apache/hudi/issues/11204#issuecomment-2153650845 @KnightChess do you have intreast to push-forward this feature? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]
cono commented on issue #11204: URL: https://github.com/apache/hudi/issues/11204#issuecomment-2152342411 This is really useful feature to have. We want to use Hudi at work, but unfortunately we have couple of bucketed/sorted tables, and this is definitely a stopper for us to migrate to Hudi. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] [hudi]
danny0405 commented on issue #11403: URL: https://github.com/apache/hudi/issues/11403#issuecomment-2151279640 I would suggest you use the 0.12.3 or 0.14.1, 0.12.1 still got some stability issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] [hudi]
zaminhassnain06 opened a new issue, #11403: URL: https://github.com/apache/hudi/issues/11403 Hi Our organization is migrating from Hudi 0.6.0 to Hudi 0.12.1 and also updating the required spark and EMR versions. Our existing data sets (100s of TBs of data on S3) are written using Hudi 0.6.0. The latest version of Hudi has come way since 0.6.0, we are not sure about how to use 0.12.1 directly. Could someone provide the steps for upgrading from 0.6.0 to 0.12.1? Do we have to rebuild our tables, we are more concerned about this as tables are having billions of records ? Should we expect following imporvements after the upgrade: – faster upserts – columns add/modify (schema evolution) – clustering – possible solution for storing history of updates performed on recrods Thanks, Zamin Hassnain -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]
SuneethaYamani commented on issue #11273: URL: https://github.com/apache/hudi/issues/11273#issuecomment-2144490005 @ad1happy2go can you please share the config to disable this. Temporirly I changed hoodie.metadata.compact.max.delta.commits=365 to avoid this blocker I am using below config arguments = [ "--table-type", table_type, "--op", op, "--enable-sync", "--source-ordering-field", source_ordering_field, "--source-class", "org.apache.hudi.utilities.sources.JsonDFSSource", "--target-table", table_name, "--target-base-path", hudi_target_path, "--payload-class", "org.apache.hudi.common.model.HoodieAvroPayload", "--transformer-class", "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer", "--props", props, "--schemaprovider-class", "org.apache.hudi.utilities.schema.FilebasedSchemaProvider", "--hoodie-conf", "hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator", "--hoodie-conf", "hoodie.datasource.write.recordkey.field={}".format(record_key), "--hoodie-conf", "hoodie.datasource.write.partitionpath.field={}".format(partition_field), "--hoodie-conf", "hoodie.streamer.source.dfs.root={}".format(delta_streamer_source), "--hoodie-conf", "hoodie.datasource.write.precombine.field={}".format(precombine), "--hoodie-conf", "hoodie.database.name={}".format(glue_db), "--hoodie-conf", "hoodie.datasource.hive_sync.enable=true", "--hoodie-conf", "hoodie.metadata.record.index.enable=true", "--hoodie-conf", "hoodie.datasource.insert.dup.policy=true", "--hoodie-conf", "hoodie.table.cdc.enabled=true", "--hoodie-conf", "hoodie.index.type=RECORD_INDEX", "--hoodie-conf", "hoodie.datasource.hive_sync.table={}".format(table_name), "--hoodie-conf", "hoodie.datasource.hive_sync.partition_fields={}".format(partition_field), "--hoodie-conf", "hoodie.datasource.schema.avro.path={}".format(schema_path), "--hoodie-conf", "hoodie.datasource.schema.strategy=UNION", "--hoodie-conf", "hoodie.streamer.transformer.sql={}".format(sql), "--hoodie-conf", "hoodie.streamer.schemaprovider.source.schema.file={}".format(schema_path), "--hoodie-conf", "hoodie.streamer.schemaproider.target.schema.file={}".format(schema_path), "--hoodie-conf", "hoodie.metadata.compact.max.delta.commits=365" ] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]
ad1happy2go commented on issue #11273: URL: https://github.com/apache/hudi/issues/11273#issuecomment-2142552863 @SuneethaYamani Metadata table helps you to reduce file listing api calls. You can disable in case this is only becoming the bottleneck. Although we want to understand why it's taking so long. Can you share writer configs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Sink Connector shows broker disconnected [hudi]
prabodh1194 commented on issue #9070: URL: https://github.com/apache/hudi/issues/9070#issuecomment-2139020981 but still facing a bunch of issues in the java classpath. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Sink Connector shows broker disconnected [hudi]
prabodh1194 commented on issue #9070: URL: https://github.com/apache/hudi/issues/9070#issuecomment-2138022231 yeah. i just wanted to check out kafka-connect. got massively stuck on this issue :( . anyways, i think prefixing the props with `consumer.override` works well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Sink Connector shows broker disconnected [hudi]
soumilshah1995 commented on issue #9070: URL: https://github.com/apache/hudi/issues/9070#issuecomment-2137973021 why not use deltastreamer instead ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Sink Connector shows broker disconnected [hudi]
prabodh1194 commented on issue #9070: URL: https://github.com/apache/hudi/issues/9070#issuecomment-2137909921 i am facing same issue. i have searched around everywhere. what am I missing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]
SuneethaYamani commented on issue #11273: URL: https://github.com/apache/hudi/issues/11273#issuecomment-2134603908 @ad1happy2go Yes it is for metadata -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]
ad1happy2go commented on issue #11273: URL: https://github.com/apache/hudi/issues/11273#issuecomment-2134341804 @SuneethaYamani That's not possible. Can you share the configs. One thing may be compaction what you are seeing is not for your main table, It may be for metadata table which is MOR by design Can you confirm if it's the metadata table. You can try disabling metadata table also. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] [hudi]
Pavan792reddy opened a new issue, #11275: URL: https://github.com/apache/hudi/issues/11275 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1. 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.14 * Spark version :3.3.2 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) :GCS * Running on Docker? (yes/no) : NO **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` spark-submit--master 'local[*]'--deploy-mode client--packages 'org.apache.hudi:hudi-spark3.1-bundle_2.12:0.14.1,io.streamnative.connectors:pulsar-spark-connector_2.12:3.2.0.2' --repositories https://repo.maven.apache.org/maven2--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --jars '/home/pavankumar_reddy/hudi-spark3.1-bundle_2.12-0.14.1.jar,/home/pavankumar_reddy/hudi-utilities_2.12-0.14.1.jar' --class org.apache.hudi.utilities.streamer.HoodieStreamer ls /home/pavankumar_reddy/hudi-utilities-slim-bundle_2.12-0.14.1.jar --source-class org.apache.hudi.utilities.sources.PulsarSource --source-ordering-field when--target-base-path gs://pulsarstreamer-test/hudi_data/avroschema_stream--target-table avroschema_stre--hoodie-conf hoodie.datasource.writ e.recordkey.field=id--hoodie-conf hoodie.datasource.write.partitionpath.field=id--table-type COPY_ON_WRITE --op UPSERT--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator --hoodie-conf hoodie.streamer.source.pulsar.topic=persistent://mytenant/mynamespace/avroschema --hoodie-conf hoodie.streamer.source.pulsar.endpoint.service.url=pulsar://localhost:6650 --hoodie-conf hoodie.streamer.source.pulsar.endpoint.admin.url=pulsar://localhost:8080 --continuous Error:- 24/05/22 13:00:51 INFO org.apache.pulsar.client.impl.ConnectionPool: [[id: 0x6e32bfc9, L:/10.128.0.70:55298 - R:10.128.0.40/10.128.0.40:6650]] Connected to server 24/05/22 13:00:51 INFO org.apache.pulsar.client.impl.ClientCnx: [id: 0x6e32bfc9, L:/10.128.0.70:55298 - R:10.128.0.40/10.128.0.40:6650] Connected through proxy to target broker at localhost:6650 24/05/22 13:00:51 INFO org.apache.pulsar.client.impl.ConsumerImpl: [persistent://mytenant/mynamespace/avroschema][spark-pulsar-batch-97273cbf-ccc7-4e63-9c0c-60642c1ff1ed-persistent://mytenant/mynamespace/avroschema] Subscribing to topic on cnx [id: 0x6e32bfc9, L:/10.128.0.70:55298 - R:10.128.0.40/10.128.0.40:6650], consumerId 0 24/05/22 13:00:51 INFO org.apache.pulsar.client.impl.ConsumerImpl: [persistent://mytenant/mynamespace/avroschema][spark-pulsar-batch-97273cbf-ccc7-4e63-9c0c-60642c1ff1ed-persistent://mytenant/mynamespace/avroschema] Subscribed to topic on 10.128.0.40/10.128.0.40:6650 -- consumer: 0 24/05/22 13:00:51 ERROR org.apache.hudi.utilities.streamer.HoodieStreamer: Shutting down delta-sync due to exception java.lang.UnsupportedOperationException: MessageId is null at org.apache.pulsar.client.impl.MessageIdImpl.compareTo(MessageIdImpl.java:214) at org.apache.pulsar.client.impl.MessageIdImpl.compareTo(MessageIdImpl.java:32) at org.apache.pulsar.client.impl.ConsumerImpl.hasMoreMessages(ConsumerImpl.java:2291) at org.apache.pulsar.client.impl.ConsumerImpl.hasMessageAvailableAsync(ConsumerImpl.java:2237) at org.apache.pulsar.client.impl.ConsumerImpl.hasMessageAvailable(ConsumerImpl.java:2181) at org.apache.spark.sql.pulsar.PulsarHelper.getUserProvidedMessageId(PulsarHelper.scala:451) at org.apache.spark.sql.pulsar.PulsarHelper.$anonfun$fetchCurrentOffsets$1(PulsarHelper.scala:415) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.immutable.Map$Map1.foreach(Map.scala:193) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.pulsar.PulsarHelper.fetchCurrentOffsets(PulsarHelper.scala:408) at
[I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]
SuneethaYamani opened a new issue, #11273: URL: https://github.com/apache/hudi/issues/11273 Hi, I am creating COW table.I want run compaction separately instead of along with my write operation.So I used hoodie.datasource.write.streaming.disable.compaction=true. Still compaction is getting triggered. Usually data write was happening in 2min when ever compaction is getting triggered jobs are staying stuck in running state, Thanks, Suneetha -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]
soumilshah1995 closed issue #11258: [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input URL: https://github.com/apache/hudi/issues/11258 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]
soumilshah1995 commented on issue #11258: URL: https://github.com/apache/hudi/issues/11258#issuecomment-2125096672 Thanks man -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]
soumilshah1995 commented on issue #11258: URL: https://github.com/apache/hudi/issues/11258#issuecomment-2125075068 really let me try -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]
ad1happy2go commented on issue #11258: URL: https://github.com/apache/hudi/issues/11258#issuecomment-2123907590 @soumilshah1995 Your transformer class should be --transformer-class org.apache.hudi.utilities.transform.SqlFileBasedTransformer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]
soumilshah1995 commented on issue #11258: URL: https://github.com/apache/hudi/issues/11258#issuecomment-2119296530 when providing a sql file as I/p ``` java.lang.IllegalArgumentException: Property hoodie.streamer.transformer.sql not found ``` looks like it still looking for hoodie.streamer.transformer.sql -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]
soumilshah1995 opened a new issue, #11258: URL: https://github.com/apache/hudi/issues/11258 Here is Delta Streamer ``` spark-submit \ --class org.apache.hudi.utilities.streamer.HoodieStreamer \ --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0 \ --properties-file spark-config.properties \ --master 'local[*]' \ --executor-memory 1g \ /Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \ --table-type COPY_ON_WRITE \ --op UPSERT \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --source-ordering-field replicadmstimestamp \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E1/silver/ \ --target-table invoice \ --props hudi_tbl.props ``` # Hudi prop ``` hoodie.streamer.source.hoodieincr.missing.checkpoint.strategy=READ_UPTO_LATEST_COMMIT hoodie.streamer.source.hoodieincr.path=s3a://warehouse/default/table_name=orders hoodie.datasource.write.recordkey.field=order_id hoodie.datasource.write.partitionpath.field= hoodie.datasource.write.precombine.field=ts ``` Tried following options ``` hoodie.streamer.transformer.sql.file=join.sql OR hoodie.streamer.transformer.sql.file=file:///Users/soumilshah/IdeaProjects/SparkProject/deltastreamerBroadcastJoins/join.sql OR hoodie.streamer.transformer.sql.file=/Users/soumilshah/IdeaProjects/SparkProject/deltastreamerBroadcastJoins/join.sql ``` # Error Message ``` FO BaseHoodieTableFileIndex: Refresh table orders, spent: 15 ms 24/05/19 12:38:37 ERROR HoodieStreamer: Shutting down delta-sync due to exception java.lang.IllegalArgumentException: Property hoodie.streamer.transformer.sql not found at org.apache.hudi.common.util.ConfigUtils.getStringWithAltKeys(ConfigUtils.java:334) at org.apache.hudi.common.util.ConfigUtils.getStringWithAltKeys(ConfigUtils.java:308) at org.apache.hudi.utilities.transform.SqlQueryBasedTransformer.apply(SqlQueryBasedTransformer.java:52) at org.apache.hudi.utilities.transform.ChainedTransformer.apply(ChainedTransformer.java:105) at org.apache.hudi.utilities.streamer.StreamSync.lambda$fetchFromSource$0(StreamSync.java:530) at org.apache.hudi.common.util.Option.map(Option.java:108) at org.apache.hudi.utilities.streamer.StreamSync.fetchFromSource(StreamSync.java:530) at org.apache.hudi.utilities.streamer.StreamSync.readFromSource(StreamSync.java:495) at org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:405) at org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.lambda$startService$1(HoodieStreamer.java:757) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) 24/05/19 12:38:37 INFO HoodieStreamer: Delta Sync shutdown. Error ?true 24/05/19 12:38:37 INFO HoodieStreamer: Ingestion completed. Has error: true 24/05/19 12:38:37 INFO StreamSync: Shutting down embedded timeline server 24/05/19 12:38:37 ERROR HoodieAsyncService: Service shutdown with error java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: Property hoodie.streamer.transformer.sql not found at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2005) at org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103) at org.apache.hudi.utilities.ingestion.HoodieIngestionService.startIngestion(HoodieIngestionService.java:65) at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) at org.apache.hudi.utilities.streamer.HoodieStreamer.sync(HoodieStreamer.java:205) at org.apache.hudi.utilities.streamer.HoodieStreamer.main(HoodieStreamer.java:584) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at
[I] [SUPPORT] Hudi COW Encryptions [hudi]
soumilshah1995 opened a new issue, #11257: URL: https://github.com/apache/hudi/issues/11257 # Sample Code ``` try: import os import sys import uuid import pyspark import datetime from pyspark.sql import SparkSession from pyspark import SparkConf, SparkContext from faker import Faker import datetime from datetime import datetime import random import pandas as pd # Import Pandas library for pretty printing print("Imports loaded ") except Exception as e: print("error", e) HUDI_VERSION = '1.0.0-beta1' SPARK_VERSION = '3.4' os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@11" SUBMIT_ARGS = f"--packages org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION} pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS os.environ['PYSPARK_PYTHON'] = sys.executable # Spark session spark = SparkSession.builder \ .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \ .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \ .config('className', 'org.apache.hudi') \ .config('spark.sql.hive.convertMetastoreParquet', 'false') \ .getOrCreate() spark._jsc.hadoopConfiguration().set("parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") spark._jsc.hadoopConfiguration().set("parquet.encryption.kms.client.class" , "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") spark._jsc.hadoopConfiguration().set("parquet.encryption.footer.key", "k1") spark._jsc.hadoopConfiguration().set("parquet.encryption.column.keys", "k2:customer_id") global faker faker = Faker() def get_customer_data(total_customers=2): customers_array = [] for i in range(0, total_customers): customer_data = { "customer_id": str(uuid.uuid4()), "name": faker.name(), "state": faker.state(), "city": faker.city(), "email": faker.email(), "created_at": datetime.now().isoformat().__str__(), "adqdress": faker.address(), "salary": faker.random_int(min=3, max=10) } customers_array.append(customer_data) return customers_array global total_customers, order_data_sample_size total_customers = 1 customer_data = get_customer_data(total_customers=total_customers) spark_df_customers = spark.createDataFrame(data=[tuple(i.values()) for i in customer_data], schema=list(customer_data[0].keys())) spark_df_customers.show(1, truncate=False) spark_df_customers.printSchema() def write_to_hudi(spark_df, table_name, db_name, method='upsert', table_type='COPY_ON_WRITE', recordkey='', precombine='', partition_fields='', index_type='BLOOM' ): path = f"file:///Users/soumilshah/IdeaProjects/SparkProject/tem/database={db_name}/table_name{table_name}" hudi_options = { 'hoodie.table.name': table_name, 'hoodie.datasource.write.table.type': table_type, 'hoodie.datasource.write.table.name': table_name, 'hoodie.datasource.write.operation': method, 'hoodie.datasource.write.recordkey.field': recordkey, 'hoodie.datasource.write.precombine.field': precombine, "hoodie.datasource.write.partitionpath.field": partition_fields, "hoodie.index.type": index_type, } if index_type == 'RECORD_INDEX': hudi_options.update({ "hoodie.enable.data.skipping": "true", "hoodie.metadata.enable": "true", "hoodie.metadata.index.column.stats.enable": "true", "hoodie.write.concurrency.mode": "optimistic_concurrency_control", "hoodie.write.lock.provider": "org.apache.hudi.client.transaction.lock.InProcessLockProvider", "hoodie.metadata.record.index.enable": "true" }) print("\n") print(path) print("\n") spark_df.write.format("hudi"). \ options(**hudi_options). \ mode("append"). \ save(path) write_to_hudi( spark_df=spark_df_customers, db_name="default", table_name="customers", recordkey="customer_id", precombine="created_at", partition_fields="state", index_type="BLOOM" ) ``` # Error ``` 24/05/19 11:01:48 ERROR SimpleExecutor: Failed consuming records org.apache.parquet.crypto.ParquetCryptoRuntimeException:
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
ad1happy2go commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2117238692 Thanks @matthijseikelenboom for the update -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
matthijseikelenboom closed issue #11170: [SUPPORT] Hudi fails ACID verification test URL: https://github.com/apache/hudi/issues/11170 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
matthijseikelenboom commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2117016086 Tested and verified. Closing issues. More info Solution has been tested on: - Java 8 ✅ - Java 11 ✅ - Java 17 ❌ (As of this moment, Hudi doesn't support this version) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
ad1happy2go commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2115400503 @matthijseikelenboom I was able to successfully test. There were two issues - 1. InprocessLockProvider doesn't work for multiple writes. So use FileSystemBasedLockProvider in transactionWriter.java ``` dataSet.write().format("hudi") .option("hoodie.table.name", tableName) .option("hoodie.datasource.write.recordkey.field", "primaryKeyValue") .option("hoodie.datasource.write.partitionpath.field", "partitionKeyValue") .option("hoodie.datasource.write.precombine.field", "dataValue") .option("hoodie.write.lock.provider", "org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider") .mode(SaveMode.Append) .save(tablePath); ``` 3. Along with refresh, to add partitions to mock repair also in ReaderThread. ``` session.sql("REFRESH TABLE " + fullyQualifiedTableName); session.sql("MSCK REPAIR TABLE" + fullyQualifiedTableName); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
ad1happy2go commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2115401719 @matthijseikelenboom Please let us know if it works for you also. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
ad1happy2go commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2115007277 @matthijseikelenboom I tried to run in my local but again seeing issues. We can connect once. If you are on Apache Hudi slack can you ping me "Aditya Goenka" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi 0.14.0 - deletion from table failing for org.apache.hudi.keygen.TimestampBasedKeyGenerator [hudi]
Priyanka128 commented on issue #10823: URL: https://github.com/apache/hudi/issues/10823#issuecomment-2111792611 > I think your timestamp.type should be "DATE_STRING". Tried this but getting below exception: _Caused by: java.lang.RuntimeException: hoodie.keygen.timebased.timestamp.scalar.time.unit is not specified but scalar it supplied as time value at org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.convertLongTimeToMillis(TimestampBasedAvroKeyGenerator.java:216) at org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:187) at org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:118) ... 18 more_ After encountering this exception, removed "hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit" -> "DAYS" but still same exception "Caused by: java.lang.RuntimeException: hoodie.keygen.timebased.timestamp.scalar.time.unit is not specified but scalar it supplied as time value" was coming. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]
ziudu commented on issue #11204: URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109284886 I'm a newbie. It took me a while to understand why bucket join does not work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]
danny0405 commented on issue #11204: URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109221161 > So if we have to choose one between spark and hive, I think spark might be of higher priority I agree, do you have energy to complete that suspended PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]
ziudu commented on issue #11204: URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109160408 Hi Danny0405, I think the support for 2 hudi tables' Spark sort-merge-join with bucket optimization is an important feature. Currently if we join 2 hudi tables, the bucket index's bucket information is not usable by spark, so shuffle is always needs. As explained in [8657](https://github.com/apache/hudi/pull/8657) - hashing- file naming- file numbering- file sorting are different. Unfortunately, according to https://issues.apache.org/jira/browse/SPARK-19256, spark bucket is not compatible with hive bucket yet. So if we have to choose one between spark and hive, I think spark might be of higher priority. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]
ziudu opened a new issue, #11204: URL: https://github.com/apache/hudi/issues/11204 According to [parisni in [HUDI-6150] Support bucketing for each hive client (https://github.com/apache/hudi/pull/8657) "So I assume hudi way of doing (which is not compliant with both hive and spark) cannot be used to improve query engines queries such join and filter. Then this leads all of below are wrong: the current config https://hudi.apache.org/docs/configurations/#hoodiedatasourcehive_syncbucket_sync this current PR the rfc statement about support of hive bucketing https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index; Do you have any update on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi could override users' configurations [hudi]
boneanxs commented on issue #11188: URL: https://github.com/apache/hudi/issues/11188#issuecomment-2105500024 > > I actually see hudi could set many spark relate configures in SparkConf, most of them are related to parquet reader/writer. > > Are these options configurable? Yes, these configures could be set by users -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi could override users' configurations [hudi]
danny0405 commented on issue #11188: URL: https://github.com/apache/hudi/issues/11188#issuecomment-2105384268 > I actually see hudi could set many spark relate configures in SparkConf, most of them are related to parquet reader/writer. Are these options configurable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Hudi could override users' configurations [hudi]
boneanxs opened a new issue, #11188: URL: https://github.com/apache/hudi/issues/11188 We recently also met the issue https://github.com/apache/hudi/issues/9305, but with the different cause(we still use hudi 0.12). The user set the configure `spark.sql.parquet.enableVectorizedReader` to false manually, and read a hive table and cache it. Given spark will analyze the plan firstly if it needs to be cached, so currently spark won't add `C2R` to that cached plan since vectorized reader is false. At currently, spark won't execute that plan since there's no action operator. Then user tries to read a MOR read_optimized table and join that cached plan and get the result, as mor table will automatically update the `enableVectorizedReader` to true, actually that hive table is read as column batch, but the plan doesn't contain `C2R` to convert the batch to row, whereas the error occurs: ![Screenshot 2024-05-10 at 18 32 22](https://github.com/apache/hudi/assets/10115332/14b387e0-ecee-4c04-9aff-ba024ce3af55) ```java ava.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:302) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1481) at ``` ```scala override def imbueConfigs(sqlContext: SQLContext): Unit = { super.imbueConfigs(sqlContext) sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader", "true") } ``` I see there's some modification in the master code, but I suspect this issue could still happen since we'd also modify it in `HoodieFileGroupReaderBasedParquetFileFormat`: ```scala spark.conf.set("spark.sql.parquet.enableVectorizedReader", supportBatchResult) ``` Besides this issue, Is it suitable to set spark configures globally? No matter users set it or not, I actually see hudi could set many spark relate configures in `SparkConf`, most of them are related to parquet reader/writer. This could confuse users and make it hard for devs to find the cause. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
matthijseikelenboom commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2102747936 @ad1happy2go I've pushed a new branch on the repo where the project is downgraded to Java 8. When running the test then, the writers don't seem to fail anymore, but it still fails the verification test. https://github.com/apache/hudi/assets/1364843/30384d79-6905-4c2e-96f0-e246d5589469;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Record Index not working as Expected: gives warning as "WARN SparkMetadataTableRecordIndex: Record index not initialized so falling back to GLOBAL_SIMPLE for tagging records" [h
zeeshan-media closed issue #10507: [SUPPORT] Hudi Record Index not working as Expected: gives warning as "WARN SparkMetadataTableRecordIndex: Record index not initialized so falling back to GLOBAL_SIMPLE for tagging records" URL: https://github.com/apache/hudi/issues/10507 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
matthijseikelenboom commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2102603694 Okay, yeah sure. The original test was written with Java 11, but I updated to 17 because I thought why not and Spark 3.4.2 supports it. Is it known that Hudi (Or Kryo) also doesn't work with Java 11 and is that why you suggest Java 8? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
ad1happy2go commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2102355732 @matthijseikelenboom I noticed you are using JAVA 17 for the same. Hudi 0.14.1 doesn't support JAVA 17 yet. The newer Hudi version will be able to support the same. Some reference to similar issue related to java 17 here - https://github.com/EsotericSoftware/kryo/issues/885 Can you try with JAVA 8 once. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
ad1happy2go commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2101987758 @matthijseikelenboom Looks like some library conflicts are there in the project. Need to reproduce it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
matthijseikelenboom commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2100205976 @ad1happy2go Ah yes, you're right. I seem to have forgot to add the hudi-defaults.conf file to this project. I've added it to my repository and ran the test again. It comes further along, but still breaks down. Stacktrace (Be warned, it's a big one): ``` ERROR! : Failed to upsert for commit time 20240508114518478 24/05/08 11:45:18 ERROR TransactionWriter: Exception in writer. java.lang.RuntimeException: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20240508114518478 at org.example.writer.TransactionWriter.wrapOrRethrowException(TransactionWriter.java:192) at org.example.writer.TransactionWriter.tryTransaction(TransactionWriter.java:184) at org.example.writer.TransactionWriter.updateTransaction(TransactionWriter.java:143) at org.example.writer.TransactionWriter.lambda$handleTransaction$0(TransactionWriter.java:89) at org.example.writer.TransactionWriter.withRetryOnException(TransactionWriter.java:109) at org.example.writer.TransactionWriter.handleTransaction(TransactionWriter.java:83) at org.example.writer.TransactionWriter.run(TransactionWriter.java:70) Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20240508114518478 at org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:70) at org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:44) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:114) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103) at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:142) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:224) at org.apache.hudi.HoodieSparkSqlWriterInternal.liftedTree1$1(HoodieSparkSqlWriter.scala:504) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:502) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:204) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.executeUpsert(MergeIntoHoodieTableCommand.scala:439) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:282) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488) at
Re: [I] [SUPPORT] Hudi fails ACID verification test [hudi]
ad1happy2go commented on issue #11170: URL: https://github.com/apache/hudi/issues/11170#issuecomment-2100011078 @matthijseikelenboom I don't see any lock related configurations in your setup. I checked that you are using 2 parallel writers. So you may need to configure lock during write. Hudi follows OCC principal. Check multi writer setup here - https://hudi.apache.org/docs/concurrency_control/#model-c-multi-writer Let me know in case I am missing anything on the same. Thanks a lot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Hudi fails ACID verification test [hudi]
matthijseikelenboom opened a new issue, #11170: URL: https://github.com/apache/hudi/issues/11170 **Describe the problem you faced** For work we had needed to have a concurrent read/write support for our data lake, which uses Spark. We where noticing some inconsistencies, so we wrote a test that can verify whether something like Hudi adheres to ACID. We did however find that Hudi fails this test. Now, it can be that we've wrongly configured Hudi or that there is some mistake in the test code. My question is if someone of you can take a look at it, and perhaps can explain what is going wrong here. **To Reproduce** How to run the test and it's findings are described in the README of the repository, but here is a short run down Steps to reproduce the behavior: 1. Check out repo: [hudi-acid-verification](https://github.com/matthijseikelenboom/hudi-acid-verification) 2. Start Docker if not already running 3. Run the test [TransactionManagerTest.java](https://github.com/matthijseikelenboom/hudi-acid-verification/blob/master/src/test/java/org/example/writer/TransactionManagerTest.java) 4. Observe that writers breakdown and that very transactions have been processed. **Expected behavior** 1. I expect the writers not to break down 2. I expect that the full amount of transactions are executed **Environment Description** * Hudi version : 0.14.1 * Spark version : 3.4.2 * Hive version : 4.0.0-beta-1 * Hadoop version : 3.2.2 * Storage (HDFS/S3/GCS..) : NTFS(Windows), APFS(macOS) & HDFS * Running on Docker? (yes/no) : No **Additional context** It's worth noting that other solutions, Iceberg and Delta Lake, have also been tested this way. Iceberg also didn't pass this test. Delta Lake did pass the test. **Stacktrace** ``` 24/05/07 21:49:38 ERROR TransactionWriter: Exception in writer. org.example.writer.TransactionFailedException: org.apache.hudi.exception.HoodieRollbackException: Failed to rollback file:/tmp/lakehouse/concurrencytestdb.db/acid_verification commits 20240507214932607 at org.example.writer.TransactionWriter.wrapOrRethrowException(TransactionWriter.java:190) at org.example.writer.TransactionWriter.tryTransaction(TransactionWriter.java:184) at org.example.writer.TransactionWriter.updateTransaction(TransactionWriter.java:143) at org.example.writer.TransactionWriter.lambda$handleTransaction$0(TransactionWriter.java:89) at org.example.writer.TransactionWriter.withRetryOnException(TransactionWriter.java:109) at org.example.writer.TransactionWriter.handleTransaction(TransactionWriter.java:83) at org.example.writer.TransactionWriter.run(TransactionWriter.java:70) Caused by: org.apache.hudi.exception.HoodieRollbackException: Failed to rollback file:/tmp/lakehouse/concurrencytestdb.db/acid_verification commits 20240507214932607 at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1065) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1012) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:940) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:922) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:917) at org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:941) at org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222) at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:940) at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:933) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:501) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:204) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.executeUpsert(MergeIntoHoodieTableCommand.scala:439) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:282) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at
Re: [I] [SUPPORT] Hudi MOR high latency on data availability [hudi]
sgcisco commented on issue #8: URL: https://github.com/apache/hudi/issues/8#issuecomment-2089073931 @ad1happy2go record key looks as `record_keys=["timestamp", "A", "B", "C"],`. Where `timestamp` is monotonically increasing in ms, `A` a string with a range of some 500k values, `B` is similar to `A`, `C` is max hundred values. We use `upsert` which is a default operation but we don't expect any updates on the inserted values. We tried `insert` but observed latencies were worse. Increasing partitioning granularity from daily to hourly seems help to decrease latencies but not to solve the problem completely. ![Screenshot 2024-05-01 at 22 07 16](https://github.com/apache/hudi/assets/168409126/7cd6bd72-2ecb-4826-99f6-567481b234bc) In this case partitioning size goes down from 100Gb to 4.7Gb. > Are you seeing the disk spill during this operation, you can try increasing the executor memory to avoid the same. No, over 15h running job ![Screenshot 2024-05-01 at 22 19 07](https://github.com/apache/hudi/assets/168409126/c860312d-ed03-427d-aaa8-ca9bedcb0ed5) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi MOR high latency on data availability [hudi]
ad1happy2go commented on issue #8: URL: https://github.com/apache/hudi/issues/8#issuecomment-2088378573 @sgcisco What is nature of your record key? Is it random id ? Building workload profile do the index lookup which is basically the join between the existing data with the incremental data to identify which records to be updated or inserted. Are you seeing the disk spill during this operation, you can try increasing the executor memory to avoid the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi MOR high latency on data availability [hudi]
sgcisco commented on issue #8: URL: https://github.com/apache/hudi/issues/8#issuecomment-2086534021 @ad1happy2go thanks for your reply. We tried `compact num.delta commits as 1` in one of the tests for other runs and in what try to use now it is a default value which is 5. As another test attempt we tried to run a pipeline over several days but with lower ingestion rate 600Kb/s and the same Hudi and Spark configuration as above. The most time consuming stage is `Building workload profile` which takes 2.5 - 12 min, with average around 7 min. ![Screenshot 2024-04-30 at 19 44 00](https://github.com/apache/hudi/assets/168409126/ceb6353a-b90f-4abd-8111-5477338701d5) ![Screenshot 2024-04-30 at 20 37 15](https://github.com/apache/hudi/assets/168409126/03b7fe99-7eba-4a24-b4b6-446a6b527c67) So in this case it is around 35-40Mb per minute, current Structured Streaming minibatch, and workers can go up to 35Gb and 32 cores. Does it look as a sufficient resource config for Hudi to handle such load? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi MOR high latency on data availability [hudi]
ad1happy2go commented on issue #8: URL: https://github.com/apache/hudi/issues/8#issuecomment-2085803990 Thanks for raising this @sgcisco . I noticed you are using compact num.delta commits as 1. Any reason for the same. If we need to compact after every commit, then better we use COW table itself. One other reason may be the ingestion Job is starved of resources as async compact job may be consuming. Did we analysed spark UI. Which stage is started taking more time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Hudi MOR high latency on data availability [hudi]
sgcisco opened a new issue, #8: URL: https://github.com/apache/hudi/issues/8 **Describe the problem you faced** Running a streaming solution with Kafka - Structured Streaming (PySpark) - Hudi (MOR tables) + AWS Glue+S3 we observed periodically growing latencies on data availability at Hudi. Latencies were measured as difference between data generation `timestamp` and `_hudi_commit_timestamp` and could go up to 30 min. Periodical manual checks for the latest available data points `timestamps`, by running queries as described here https://hudi.apache.org/docs/0.13.1/querying_data#spark-snap-query, confirmed such delays. ![image](https://github.com/apache/hudi/assets/168409126/5f7e6e1c-565b-47c1-b293-898cf2d8c40b) ![image](https://github.com/apache/hudi/assets/168409126/9bb379fa-85bc-467d-853f-8dc9651803b3) In case of using Spark with Hudi data read-out from Kafka had unstable rate ![Screenshot 2024-04-29 at 11 49 29](https://github.com/apache/hudi/assets/168409126/1f114523-a574-4d39-90e8-a6d674f79aa0) To exclude impact from any other components but Hudi we ran some experiments with the same configuration and ingestion settings but without Hudi and with a direct write on S3. It did not reveal any delays above 2 mins, where 1 min delay is always present due to Structured Streaming minibatch granularity. In this case a read-out Kafka rate was stable overtime. **Additional context** We tried to optimize Hudi file sizing and MOR layout by applying suggestions from these references https://github.com/apache/hudi/issues/2151#issuecomment-706400445, https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles, https://github.com/apache/hudi/issues/2151#issuecomment-706400445 We could get a target file size between 90-120Mb by downing `hoodie.copyonwrite.record.size.estimate` from 1024 to 100 and using `Inline.compact=false and delta.commits=1 and async.compact=true and hoodie.merge.small.file.group.candidates.limit=20` but it did not have any impact on a latency. Another commit strategy `NUM_OR_TIME` as suggested here https://github.com/apache/hudi/issues/8975#issuecomment-1593408753 with parameters below did not help to resolve a problem ``` "hoodie.copyonwrite.record.size.estimate": "100", "hoodie.compact.inline.trigger.strategy": "NUM_OR_TIME", "hoodie.metadata.compact.max.delta.commits": "5", "hoodie.compact.inline.max.delta.seconds": "60", ``` As a trade-off we came up to the configuration below, which allows us to have relatively low latencies for 90th percentile and file size 40-90Mb ``` "hoodie.merge.small.file.group.candidates.limit": "40", "hoodie.cleaner.policy": "KEEP_LATEST_FILE_VERSIONS", ``` ![10_31_12](https://github.com/apache/hudi/assets/168409126/bf85386b-7f6e-48a9-b855-ff8cb391080d) But still some records could go up to 30 min. ![02_42_29](https://github.com/apache/hudi/assets/168409126/9e6442c1-8cfa-4778-abc5-5d1050cb3653) However the last config works relatively well for low ingestion rates up to 1.5Mb/s with a daily partitioning `partition_date=-MM-dd/` but stops work for the rates above 2.5 Mb/s even with more granular partitioning `partition_date=-MM-dd-HH/` **Expected behavior** Since we use MOR tables: - low latencies on data availability - proper file sizing defined by the limits ``` "hoodie.parquet.small.file.limit" : "104857600", "hoodie.parquet.max.file.size" : "125829120", **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.4.1 * Hive version : 3.1 * Hadoop version : EMR 6.13 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No Hudi configuration ``` "hoodie.datasource.hive_sync.auto_create_database": "true", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.mode": "hms", "hoodie.datasource.hive_sync.table": table_name, "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.database": _glue_db_name, "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.schema.allow.auto.evolution.column.drop": "true", "hoodie.datasource.write.table.name": table_name, "hoodie.datasource.write.table.type": "MERGE_ON_READ",
Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]
codope commented on issue #8114: URL: https://github.com/apache/hudi/issues/8114#issuecomment-2081997720 Yes this was fixed in 0.13.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]
codope closed issue #8114: [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation URL: https://github.com/apache/hudi/issues/8114 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]
danny0405 commented on issue #8114: URL: https://github.com/apache/hudi/issues/8114#issuecomment-2081940660 cc @codope guess this should have been fixed? https://github.com/apache/hudi/pull/6662 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]
zhaobangcai commented on issue #8114: URL: https://github.com/apache/hudi/issues/8114#issuecomment-2081816664 Has this problem been solved? @Limess -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi partitions not dropped by Hive sync after `insert_overwrite_table` operation [hudi]
zhaobangcai commented on issue #8114: URL: https://github.com/apache/hudi/issues/8114#issuecomment-2081815485 Is there no further text on this question? Do you have any plans to fix it in which version? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] [hudi]
jack1234smith opened a new issue, #11023: URL: https://github.com/apache/hudi/issues/11023 **Describe the problem you faced** Error exception: java.util.NoSuchElementException: No value present in Option at org.apache.hudi.common.util.Option.get(Option.java:89) at org.apache.hudi.table.format.mor.MergeOnReadInputFormat.initIterator(MergeOnReadInputFormat.java:204) at org.apache.hudi.table.format.mor.MergeOnReadInputFormat.open(MergeOnReadInputFormat.java:189) at org.apache.hudi.source.StreamReadOperator.processSplits(StreamReadOperator.java:169) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:367) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:352) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) at java.lang.Thread.run(Thread.java:745) data error: ![17131698404766](https://github.com/apache/hudi/assets/50668893/9fb26f15-6228-486d-a9c5-2b70c746f784) **To Reproduce** Steps to reproduce the behavior: 1. kill yarn session 2. Restart job from checkpoint **Environment Description** * Hudi version : 0.14.1 * Hive version : 3.1.3 * Hadoop version : 3.3.6 * Storage (HDFS/S3/GCS..) : HDFS * Running on Yarn Session? (yes/no) : yes **Additional context** My table are: CREATE TABLE if not exists ods_table( id int, count_num double, write_time timestamp(0), _part string, proc_time timestamp(3), WATERMARK FOR write_time AS write_time ) PARTITIONED BY (_part) WITH ( 'connector'='hudi', 'path' ='hdfs://masters/test/ods_table', 'table.type'='MERGE_ON_READ', 'hoodie.datasource.write.recordkey.field' = 'id', 'hoodie.datasource.write.precombine.field' = 'write_time', 'write.bucket_assign.tasks'='1', 'write.tasks' = '1', 'compaction.tasks' = '1', 'compaction.async.enabled' = 'true', 'compaction.schedule.enabled' = 'true', 'compaction.trigger.strategy' = 'time_elapsed', 'compaction.delta_seconds' = '600', 'compaction.delta_commits' = '1', 'read.streaming.enabled' = 'true', 'read.streaming.skip_compaction' = 'true', 'read.start-commit' = 'earliest', 'changelog.enabled' = 'true', 'hive_sync.enable'='true', 'hive_sync.mode' = 'hms', 'hive_sync.metastore.uris' = 'thrift://h35:9083', 'hive_sync.db'='test', 'hive_sync.table'='hive_ods_table' ); CREATE TABLE if not exists ads_table( sta_date string, num double, proc_time as proctime() ) WITH ( 'connector'='hudi', 'path' ='hdfs://masters/test/ads_table', 'table.type'='COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field' = 'sta_date', 'write.bucket_assign.tasks'='1', 'write.tasks' = '1', 'compaction.tasks' = '1', 'compaction.async.enabled' = 'true', 'compaction.schedule.enabled' = 'true', 'compaction.trigger.strategy' = 'time_elapsed', 'compaction.delta_seconds' = '600', 'compaction.delta_commits' = '1', 'read.streaming.enabled' = 'true', 'read.streaming.skip_compaction' = 'true', 'read.start-commit' = 'earliest', 'changelog.enabled' = 'true', 'hive_sync.enable'='true', 'hive_sync.mode' = 'hms', 'hive_sync.metastore.uris' = 'thrift://h35:9083', 'hive_sync.db'='test', 'hive_sync.table'='hive_ads_table' ); My job is: insert into test.ads_table select _part, sum(count_num) from test.ods_table group by _part; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:
[I] [SUPPORT] [hudi]
VitoMakarevich opened a new issue, #10997: URL: https://github.com/apache/hudi/issues/10997 **Describe the problem you faced** We are using Spark 3.3 and Hudi 0.12.2. I need your assistance in helping me to improve the `Doing partition and writing data` stage. For us, it looks to be the most time consuming. We are using `snappy` compression(the most lightweight from available as I know), file size is ~160mb, which is effectively 80-90 GB GZIP(default codec in Hudi for our workload). Files itself consist of 1.5-2M rows. So our problem is that unfortunately due to partitioning + CDC nature, we must udpate a lot of files at peak hours, we have clustering to group rows together, but it's still thousands of files affected. 75th percentile of individual file overwrite(task in the `Doing partition and writing data` stage) takes ~40-60 seconds, it does not correlate to the number of rows updated inside(for 75th percentile it's < 100 rows changed in every file). Also - the payload class is almost default(minor changes which not affect performance IMO). Q: 1. What are knobs we can play with? We tried compression format(`snappy` looks to be the best among `zstd`- has memory leak in Spark 3.3 BTW and `gzip`) Also we tried `hoodie.write.buffer.limit.bytes` - rising to 32MB, unfortunately no visible difference. Is there any other? 2. Do you know some performance improvements in newer versions(0.12.3-0.14.1) regarding specifically file write(`MergeHandle`) task **Environment Description** * Hudi version : 0.12.2 * Spark version : 3.3.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi CLI bundle not working [hudi]
mansipp commented on issue #10566: URL: https://github.com/apache/hudi/issues/10566#issuecomment-2043833097 Getting the similar error while running the `commit rollback` ``` commit rollback --commit 20240408231846380 24/04/08 23:22:02 INFO InputStreamConsumer: Apr 08, 2024 11:22:02 PM org.apache.spark.launcher.Log4jHotPatchOption staticJavaAgentOption 24/04/08 23:22:02 INFO InputStreamConsumer: WARNING: spark.log4jHotPatch.enabled is set to true, but /usr/share/log4j-cve-2021-44228-hotpatch/jdk17/Log4jHotPatchFat.jar does not exist at the configured location 24/04/08 23:22:02 INFO InputStreamConsumer: 24/04/08 23:22:03 INFO InputStreamConsumer: Error: Failed to load org.apache.hudi.cli.commands.SparkMain: org/apache/hudi/common/engine/HoodieEngineContext 24/04/08 23:22:03 INFO InputStreamConsumer: 24/04/08 23:22:03 INFO ShutdownHookManager: Shutdown hook called 24/04/08 23:22:03 INFO InputStreamConsumer: 24/04/08 23:22:03 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-272bb6ef-f858-42a6-b9d0-9614f1f36371 24/04/08 23:22:03 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/ 24/04/08 23:22:03 INFO HoodieTableConfig: Loading table properties from s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/.hoodie/hoodie.properties 24/04/08 23:22:03 INFO S3NativeFileSystem: Opening 's3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/.hoodie/hoodie.properties' for reading 24/04/08 23:22:03 INFO HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from s3://mansipp-emr-dev/hudi_cli_migration/tables/mor/mansipp_hudi_mor_table_2/ Commit 20240408231846380 failed to roll back``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] [hudi]
MrAladdin opened a new issue, #10972: URL: https://github.com/apache/hudi/issues/10972 **Describe the problem you faced** spark structured streaming upsert hudi(mor、RECORD_INDEX) --- very time consuming : 1、The number of tasks in each distinct stage of building workload profile is always 60, and there is a severe data skew. I want to know why it's always 60, how to adjust, the reasons for data skew and optimization solutions. I have done my best. **Environment Description** * Hudi version :0.14.1 * Spark version :3.4.1 * Hive version :3.1.2 * Hadoop version :3.1.3 * Storage (HDFS/S3/GCS..) :hdfs * Running on Docker? (yes/no) :no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi deltastreamer fails due to Clean [hudi]
codope closed issue #7209: [SUPPORT] Hudi deltastreamer fails due to Clean URL: https://github.com/apache/hudi/issues/7209 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi deltastreamer fails due to Clean [hudi]
ad1happy2go commented on issue #7209: URL: https://github.com/apache/hudi/issues/7209#issuecomment-2029660841 @koldic Sorry we missed it. You can use multi writer concurrency control to handle that. https://hudi.apache.org/docs/concurrency_control/#enabling-multi-writing Closing this issue as it was due to multi writers. Thanks. Feel free to open new one in case of any new issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]
ROOBALJINDAL closed issue #10884: [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 URL: https://github.com/apache/hudi/issues/10884 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]
ROOBALJINDAL commented on issue #10884: URL: https://github.com/apache/hudi/issues/10884#issuecomment-2029401141 I have found the issue. We were using custom MssqlDebeziumSource class as debezium source and in constructor we were using `HoodieStreamerMetrics` instead of `HoodieIngestionMetrics` (which is introduced in hudi 14.0) Once corrected the class, it started working. We can close this issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi 0.14.0 - deletion from table failing for org.apache.hudi.keygen.TimestampBasedKeyGenerator [hudi]
xicm commented on issue #10823: URL: https://github.com/apache/hudi/issues/10823#issuecomment-2024834691 I think your timestamp.type should be "DATE_STRING". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]
ad1happy2go commented on issue #10884: URL: https://github.com/apache/hudi/issues/10884#issuecomment-2007239157 Dont think it can be kafka version related issue as job is not failing. we need to know more logs to debug this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]
ROOBALJINDAL commented on issue #10884: URL: https://github.com/apache/hudi/issues/10884#issuecomment-2006856626 @ad1happy2go need time to setup new cluster. Our aws msk kafka cluster uses kafka version=2.6.2, can you confirm is this fine or this can be an issue? Any specific supported version of kafka? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]
ad1happy2go commented on issue #10884: URL: https://github.com/apache/hudi/issues/10884#issuecomment-2006696281 @ROOBALJINDAL Is it possible to try the same on EMR so that you will get all the logs to look into this more. There is no known updates which can cause this for 0.14.0 upgrade. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]
ROOBALJINDAL commented on issue #10884: URL: https://github.com/apache/hudi/issues/10884#issuecomment-2006449206 @nsivabalan can you please check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Hudi cdc upserts stopped working after migrating from hudi 13.1 to 14.0 [hudi]
ROOBALJINDAL opened a new issue, #10884: URL: https://github.com/apache/hudi/issues/10884 Issue: We have migrated from Hudi 0.13.0 to Hudi 0.14.0 and in this version, CDC events from Kafka upserts are not working. Table is created first time but afterwards, any new record added/updated into the sql table which pushes cdc event to kafka is not get updated in the hudi table. Is there any new configuration for hudi 0.14.0? We are running Aws EMR serverless: 6.15. We tried to enable debug level logs by providing following classification to serverless app which modified log4j properties to print hudi package logs but this also doesnt print. ``` { "classification": "spark-driver-log4j2", "properties": { "rootLogger.level": "debug", "logger.hudi.level": "debug", "logger.hudi.name": "org.apache.hudi" } }, { "classification": "spark-executor-log4j2", "properties": { "rootLogger.level": "debug", "logger.hudi.level": "debug", "logger.hudi.name": "org.apache.hudi" } } ``` Since it is serverless we can't ssh tunnel into node and see log4j property file and couldn't get hudi logs. ### **Configurations:** **### Spark job parameters:** ``` --class org.apache.hudi.utilities.streamer.HoodieMultiTableStreamer --conf spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED --conf spark.sql.avro.datetimeRebaseModeInRead=CORRECTED --conf spark.executor.instances=1 --conf spark.executor.memory=4g --conf spark.driver.memory=4g --conf spark.driver.cores=4 --conf spark.dynamicAllocation.initialExecutors=1 --props kafka-source.properties --config-folder table-config --payload-class com.myorg.MssqlDebeziumAvroPayload --source-class com.myorg.MssqlDebeziumSource --source-ordering-field _event_lsn --enable-sync --table-type COPY_ON_WRITE --source-limit 10 --op UPSERT ``` **### kafka-source.properties:** ``` hoodie.streamer.ingestion.tablesToBeIngested=database1.student auto.offset.reset=earliest hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator hoodie.streamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer hoodie.streamer.schemaprovider.registry.url= schema.registry.url=http://schema-registry-x:8080/apis/ccompat/v6 bootstrap.servers=b-1..ikwdtc.c13.us-west-2.amazonaws.com:9096 hoodie.streamer.schemaprovider.registry.baseUrl=http://schema-registry-x:8080/apis/ccompat/v6/subjects/ hoodie.parquet.max.file.size=2147483648 hoodie.parquet.small.file.limit=1073741824 security.protocol=SASL_SSL sasl.mechanism=SCRAM-SHA-512 sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="" password="x"; ssl.truststore.location=/usr/lib/jvm/java/jre/lib/security/cacerts ssl.truststore.password=changeit ``` **### Table config properties:** ``` hoodie.datasource.hive_sync.database=database1 hoodie.datasource.hive_sync.support_timestamp=true hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true hoodie.datasource.write.recordkey.field=studentsid hoodie.datasource.write.partitionpath.field=studentcreationdate hoodie.datasource.hive_sync.table=student hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true hoodie.datasource.hive_sync.partition_fields=studentcreationdate hoodie.keygen.timebased.timestamp.type=SCALAR hoodie.keygen.timebased.timestamp.scalar.time.unit=DAYS hoodie.keygen.timebased.input.dateformat=-MM-dd hoodie.keygen.timebased.output.dateformat=-MM-01 hoodie.keygen.timebased.timezone=GMT+8:00 hoodie.datasource.write.hive_style_partitioning=true hoodie.datasource.hive_sync.mode=hms hoodie.streamer.source.kafka.topic=dev.student hoodie.streamer.schemaprovider.registry.urlSuffix=-value/versions/latest ``` **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.4.1 * Hive version : 3.1.3 * Hadoop version :3.3.6 * Storage (HDFS/S3/GCS..) : S3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi offline compaction ignores old data [hudi]
danny0405 commented on issue #10863: URL: https://github.com/apache/hudi/issues/10863#issuecomment-1996493691 So you are using the offline compaction because the online async compaction is disabled: `compaction.async.enabled' = 'false',"`. Did you check the compaction plan to see whether the files included are expected? BTW, you specify a compaction for each commit, so why not use the COW table then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi offline compaction ignores old data [hudi]
ennox108 commented on issue #10863: URL: https://github.com/apache/hudi/issues/10863#issuecomment-1996335338 its streaming ingestion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org