Re: Question

2020-03-18 Thread Vinoth Chandar
Hi Syed, Please join the mailing list, so your responses make it here without needed approval. I am sure there is something odd going on here. Few things to check - Hudi does use memory for caching inputs and computing heuristics. I have seen slowness being caused by insufficient executor memory

Re: Question

2020-03-18 Thread Syed Zaidi
Hi Udit, Thanks for your recommendation. I was able to get the jars for 0.5.1. As a test we ran hudi against a small dataset (~2 million rows with 80 columns) in parquet file against 10 executors (m5.xlarge) . The initial load itself is taking 2+ hours. Do you have any suggestions on the settin

Re: [NOTIFICATION] Hudi 0.5.2 Release Daily Report-20200318

2020-03-18 Thread vino yang
Hi Vinoth, >>We may need to revert HUDI-676[3]: Address issues towards removing use of WIP Disclaimer >>I think we should address the feedback and ensure VOTE passed with "non >>WIP" disclaimer.. the WIP disclaimer cannot be retained forever and needs >>to be fixed before graduation IIUC. So you

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Balajee Nagasubramaniam
Hi Prashant, Regarding clean vs rollback/restoreToInstant, if you think of all the commits/datafiles in the active timeline as a queue of items, rollback/restoreToInstant would be working on the head of the queue whereas clean would be working on the tail of the queue. They should be treated as tw

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread vbal...@apache.org
Prashanth, My concern was we should not be losing metadata about clean operation.  But there is a way, As long as we are faithfully copying the clean metadata that tracks the files which got cleaned and storing in restore metadata, we should be able to keep metadata in sync. Balaji.V On

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Prashant Wason
Thanks for the info Vinoth / Balaji. To me it feels a split between easier-to-understand design and current-implementation. I feel it is simpler to reason (based on how file systems work in general) that restoreToInstant is a complete point-in-time shift to the past (like restoring a file system f

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Balaji Varadarajan
Prashanth, I think we should not be reverting clean operations here. Cleans are done on the oldest file slices and a restore/rollback is not completely undoing the work of clean that happened before it.  For incremental timeline syncing, embedded timeline server needs to read these clean metada

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Vinoth Chandar
Hi Prashant, Not sure if there is a specific reason. Mostly, it because until recently, the clean metadata was not actually used. Currently, incremental cleaning will use it, but even then, it only relies on the partition paths being touched there.. So should be fine.. +100 though on consistently

Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Prashant Wason
HI Team, I noticed that when a table is restored to a previous commit ( HoodieWriteClient::restoreToInstant ), only the COMMIT, DELTA_COMMIT and COMPACTION instants ar

Re: deltastreamer group.id Noeffectaftersetting

2020-03-18 Thread Vinoth Chandar
DeltaStreamer actually just uses the same mechanism as Spark Streaming to manage offsets. So wondering if you see the same behavior with a plain spark streaming job. ? It manages the offset checkpoints manually by itself within the hoodie commit metadata, to do exactly once ingestion of data.. On

Re: Question on DeltaStreamer

2020-03-18 Thread Vinoth Chandar
>>Lets say if I have a source table in Oracle in the format below, will my avro schema for source and target will be same. yes. if you do any transformations in between, then DeltaStreamer can make the target schema automatically. In the upcoming 0.5.2 release, we have also have org.apache.hudi.u

Re: Question on DeltaStreamer

2020-03-18 Thread Shiyan Xu
To answer your question regarding the properties file It is a way to manage a bunch of hoodie configuration; those confs will be merged with other confs passed from --hoodie-conf. See this line

Re: [NOTIFICATION] Hudi 0.5.2 Release Daily Report-20200318

2020-03-18 Thread Vinoth Chandar
Thanks for the update, vino! here's the -1 vote feedback for everyone's context.. As you bundled several ASF projects that have NOTICE files, their NOTICE > files need to be examined and parts added to your NOTICE file. [1] > License is missing information fo this file copyright Twitter [3] > Per

Question on DeltaStreamer

2020-03-18 Thread Syed Zaidi
Hi, I hope things are good. We are planning on using DetalStreamer as a client for hudi. Our plan is to use AWS DMS for initial load & CDC. The question I have is around the documentation for the properties file that I need for dfs, source & target. Where can I find more information on the prop

Re: contributor permission

2020-03-18 Thread 965147...@qq.com
this yarn log .auto.commit.interval.ms = 5000 auto.offset.reset = latest bootstrap.servers = [172.16.16.2:9092, 172.16.16.3:9092] check.crcs = true client.dns.lookup = default client.id = connections.max.idle.ms = 54 default.api.timeo

[NOTIFICATION] Hudi 0.5.2 Release Daily Report-20200318

2020-03-18 Thread vino yang
Hi all, We encountered some issues while voting RC1 on general@[1], so we canceled the vote for rc1. The blocker issues we are currently addressing are: * HUDI-720: NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles(I have opened a PR[2] to fix i

deltastreamer group.id Noeffectaftersetting

2020-03-18 Thread 965147...@qq.com
hello, all When using deltastreamer to use kafka data, I want to specify group.id, but the problem encountered is that after specifying it, I cannot find it on the kafka side. I found that there are no groups under my theme. why is it like this? I also manually set enable.auto.commit = true at