Hi Syed, as Vinoth mentioned, the HoodieSnapshotCopier is meant for this purpose
You may also read more on the RFC-9, which plans to introduce a backward-compatible tool to cover HoodieSnapshotCopier https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter Unfortunately I'm not actively working on this. If you're interested free feel to pick it up. I'd be happy to help with that. On Wed, Feb 12, 2020 at 7:25 PM Vinoth Chandar <[email protected]> wrote: > Hi Syed, > > Apologies for the delay. If you are using copy-on-write, you can look into > savepoints (although I realize its only exposed at the rdd api level).. We > do have a tool called HoodieSnapshotCopier in hoodie-utilities, to take > periodic copies/snapshots of a table for backup purposes, as of a given > commit. Raymond (if you arr here) , has an RFC to enhance that even.. > Running the copier (please test it first, since its not used in OSS that > much IIUC) periodically, say every day would achieve your goals I believe.. > > > https://github.com/apache/incubator-hudi/blob/c2c0f6b13d5b72b3098ed1b343b0a89679f854b3/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotCopier.java > > Any issues in the tool would be simple to fix. Tool itself is couple > hundred lines, that all. > > Thanks > Vinoth > > On Mon, Feb 10, 2020 at 3:56 AM Syed Abdul Kather <[email protected]> > wrote: > > > Yes. Also for restoring the data from cold storage. > > > > Use case here : > > We stream data using debezium and push to Kafka we have retention in > Kafka > > as 7 days. In case the destination table created using the hudi got > crashed > > or we need to repopulate then we need a way that can help us restore the > > data. > > > > Thanks and Regards, > > S SYED ABDUL KATHER > > *Data platform Lead @ Tathastu.ai* > > > > *+91 - 7411011661* > > > > > > On Mon, Jan 13, 2020 at 10:17 PM Vinoth Chandar <[email protected]> > wrote: > > > > > Hi Syed, > > > > > > If I follow correctly, are you asking how to do a bulk load first and > > then > > > use delta streamer on top of that dataset to apply binlogs from Kafka? > > > > > > Thanks > > > Vinoth > > > > > > On Mon, Jan 13, 2020 at 12:39 AM Syed Abdul Kather <[email protected] > > > > > wrote: > > > > > > > Hi Team, > > > > > > > > We have on-board a few tables that have really huge number of records > > > (100 > > > > M records ). The plan is like enable the binlog for database that is > no > > > > issues as stream can handle the load . But for loading the snapshot > . > > We > > > > have use sqoop to import whole table to s3. > > > > > > > > What we required here? > > > > Can we load the whole dump sqooped record to hudi table then we > would > > > use > > > > the stream(binlog data comes vai kafka) > > > > > > > > Thanks and Regards, > > > > S SYED ABDUL KATHER > > > > *Bigdata [email protected]* > > > > * +91-7411011661* > > > > > > > > > >
