Re: [VOTE] Release 1.0.0-beta2, release candidate #2

2024-07-12 Thread Balaji Varadarajan
+1 (binding) - Write and read COW table through spark Balaji.V On Friday, July 12, 2024 at 06:07:55 AM PDT, Lokesh Jain wrote: +1 (non-binding) - Verified checksums and signatures - Ran quickstart Regards Lokesh On 2024/07/08 17:57:13 sagar sumit wrote: > Hi everyone, > > Please

Re: [VOTE] Release 0.15.0, release candidate #3

2024-06-03 Thread Balaji Varadarajan
+1 (binding) Balaji.V On Sunday, June 2, 2024 at 06:48:01 PM PDT, Sivabalan wrote: +1 Ran deltastreamer tests, meta sync tests and Quick start. All good on my end. On Thu, 30 May 2024 at 16:07, Yexiang Chang wrote: > > +1, I verified 0.15.0 hudi-spark-bundle and hudi-hadoop-mr-bundle

Re: [VOTE] Release 0.14.0, release candidate #3

2023-09-22 Thread Balaji Varadarajan
+1 (binding) Ran validate stage testChecking Checksum of Source Release Checksum Check of Source Release - [OK] Checking Signature Signature Check - [OK] Checking for binary files in the source files No Binary Files in the source files? - [OK] Checking for DISCLAIMER

Re: [VOTE] Release 0.12.2, release candidate #1

2022-12-24 Thread Balaji Varadarajan
+1 (binding) Ran release validation script. (⎈|dev-core-0:N/A)balaji-varadarajan--NR26725P2G:scripts balaji.varadarajan$ ./release/validate_staged_release.sh --release=0.12.2 --rc_num=1 /tmp/validation_scratch_dir_001 ~/code/oss/hudi/scripts Downloading from svn co https://dist.apache.org

Re: [VOTE] Release 0.12.0, release candidate #2

2022-08-15 Thread Balaji Varadarajan
+1 (binding) On Monday, August 15, 2022 at 08:42:08 AM PDT, Rahil C wrote: +1 -Rahil C On Mon, Aug 15, 2022 at 8:07 AM Nishith wrote: > +1 (binding) > > -Nishith > > > On Aug 15, 2022, at 12:20 AM, Shiyan Xu > wrote: > > > > +1 (binding) > > > > Manually ran deltastreamer job with

Re: Next stop : Minor Or Major release?

2022-02-18 Thread Balaji Varadarajan
+1 on option B. Balaji.V On Thu, Feb 17, 2022 at 11:20 PM Nishith wrote: > +1 to B for the same reasons > > -Nishith > > > On Feb 17, 2022, at 9:22 PM, Vinoth Chandar wrote: > > > > +1 on B as well. same rationale as Raymond's. I think we have all major > > chunks landed or PRs up. > > Love

Re: [VOTE] Release 0.10.1, release candidate #2

2022-01-24 Thread Balaji Varadarajan
+1 binding. RC passed. Balaji.V On Monday, January 24, 2022, 10:28:58 AM PST, Bhavani Sudha wrote: +1 binding Ran RC check, quickstart and some IDE tests. Thanks, Sudha On Mon, Jan 24, 2022 at 9:23 AM sagar sumit wrote: > +1 > > - Builds for Spark2/3 [OK] > - Spark quickstart

Re: [VOTE] Release 0.10.0, release candidate #3

2021-12-06 Thread Balaji Varadarajan
+1 (binding) - Package Build successful- Overnight staging test - Data Validation successful for COW upsert workload.  On Monday, December 6, 2021, 06:40:32 AM PST, vino yang wrote: +1 (binding) - build successfully - ran spark quickstart - verified checksum Best, Vino Y Ethan

Re: [VOTE] Release 0.9.0, release candidate #2

2021-08-23 Thread Balaji Varadarajan
+1 (binding)  $ ./release/validate_staged_release.sh --release=${RC_VERSION} --rc_num=2 ...Downloading from svn co https://dist.apache.org/repos/dist//dev/hudiValidating hudi-0.9.0-rc2 with release type "dev"Checking Checksum of Source Release Checksum Check of Source Release - [OK]   % Total 

Re: [DISCUSS] Enable Github Discussions

2021-08-11 Thread Balaji Varadarajan
+1 Balaji.V On Wed, Aug 11, 2021 at 7:12 PM Bhavani Sudha wrote: > +1 > > Thanks, > Sudha > > On Wed, Aug 11, 2021 at 7:08 PM vino yang wrote: > > > +1 > > > > Best, > > Vino > > > > Pratyaksh Sharma 于2021年8月12日周四 上午2:16写道: > > > > > +1 > > > > > > I have never used it, but we can try this

Re: please give me the contributor permission

2021-01-27 Thread Balaji Varadarajan
Welcome to Apache Hudi Community !!  I have given contributor permissions. Looking forward to your contributions !! Balaji.V On Monday, January 25, 2021, 06:23:57 PM PST, jiangjiguang719 wrote: Hi, I want to contribute to Apache Hudi. Would you please give me the contributor

Re: [VOTE] Release 0.7.0, release candidate #1

2021-01-21 Thread Balaji Varadarajan
+1 (binding) 1. Ran release validation script successfully.2. Build successful3. Quickstart succeeded.  Checking Checksum of Source Release Checksum Check of Source Release - [OK]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 

Re: Congrats to our newest committers!

2020-12-03 Thread Balaji Varadarajan
Very Well deserved !! Many congratulations to Satish and Prashant. Balaji.V On Thursday, December 3, 2020, 11:07:09 AM PST, Bhavani Sudha wrote: Congratulations Satish and Prashant! On Thu, Dec 3, 2020 at 11:03 AM Pratyaksh Sharma wrote: Congratulations Satish and Prashant! On Fri,

Re: [DISCUSS] 0.7.0 release timelines

2020-12-02 Thread Balaji Varadarajan
+1 for (2) On Wednesday, December 2, 2020, 08:09:29 AM PST, vino yang wrote: +1 for option 2 Gary Li 于2020年12月2日周三 下午4:01写道: > vote for option 2. > > From: nishith agarwal > Sent: Wednesday, December 2, 2020 3:16 PM > To: dev@hudi.apache.org >

Re: why not use spark datasource in DeltaStreamer

2020-12-01 Thread Balaji Varadarajan
Regarding rdd vs dataframe, the historical reason is that RDD provided more control with low level API needed for Hudi to managing various aspects of writing.  On a related note, If you look at the current  approach with Flink support, the input batch is getting parameterized to support

Re: Reg weekly sync meeting

2020-11-02 Thread Balaji Varadarajan
+1 On Sunday, November 1, 2020, 09:13:44 PM PST, Gary Li wrote: +1 for biweekly meeting. Gary LiFrom: Vinoth Chandar Sent: Friday, October 30, 2020 2:01:22 PM To: dev@hudi.apache.org ; us...@hudi.apache.org Subject: Re: Reg weekly sync meeting + users list as well. On Thu, Oct 29,

Re: Hudi-1365

2020-11-02 Thread Balaji Varadarajan
Hi Selvaraj, I have replied in the jira. Thanks,Balaji.VOn Sunday, November 1, 2020, 01:17:05 AM PST, selvaraj periyasamy wrote: Team, Could you look into Hudi-1365? Performance is really heavily impacted for some reasons . Thanks, Selva

Re: I want to contribute to Apache Hudi

2020-10-29 Thread Balaji Varadarajan
Welcome to Apache Hudi community. I have added you as a contributor in Jira. Balaji.V On Wednesday, October 28, 2020, 08:11:00 PM PDT, jack_zhangsj wrote: Hi, I want to contribute to Apache Hudi. Would you please give me the contributor permission? My JIRA ID is  jack_zhangsj .

Re: [EXT] Re: Bucketing in Hudi

2020-10-26 Thread Balaji Varadarajan
tools as well and our choice would be based on ease of use and amount of changes.   When would be a good time to chat today or tomorrow?   Thanks, Roopa   From: Balaji Varadarajan Date: Thursday, October 22, 2020 at 9:24 PM To: "dev@hudi.apache.org" Cc: DL-AIE Subject

Re: [EXT] Re: Bucketing in Hudi

2020-10-26 Thread Balaji Varadarajan
or tomorrow?   Thanks, Roopa   From: Balaji Varadarajan Date: Thursday, October 22, 2020 at 9:24 PM To: "dev@hudi.apache.org" Cc: DL-AIE Subject: Re: [EXT] Re: Bucketing in Hudi   Hi Roopa,   Bucketing is a more general concept. I think what you are referring to is how to

Re: [EXT] Re: Bucketing in Hudi

2020-10-22 Thread Balaji Varadarajan
a spark bucketed table having metadata different from Hive bucketed tables as Spark cannot understand Hive’s hashing algorithm. Is this something that Hudi might support? Thanks, Roopa From: Balaji Varadarajan Date: Wednesday, October 21, 2020 at 9:01 PM To: "dev@hudi.apache.org"

Re: Bucketing in Hudi

2020-10-21 Thread Balaji Varadarajan
Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup is nicely abstracted out. We have a Jira for supporting Bucket Indexing :  https://issues.apache.org/jira/browse/HUDI-55  You can get bucket indexing done by implementing that interface along with additional changes

Re: Deleting Hudi Partitons

2020-10-21 Thread Balaji Varadarajan
Fixing incorrect Satish's email.On Wednesday, October 21, 2020, 06:19:43 PM PDT, Balaji Varadarajan wrote: cc Satish who implemented Insert Overwrite support. We have recently landed Insert Overwrite support in Hudi. Partition level deletion is a logical extension of this feature

Re: Deleting Hudi Partitons

2020-10-21 Thread Balaji Varadarajan
cc Satish who implemented Insert Overwrite support. We have recently landed Insert Overwrite support in Hudi. Partition level deletion is a logical extension of this feature but not currently available yet.  I have added a jira to track this :  https://issues.apache.org/jira/browse/HUDI-1350

Re: Hudi - Concurrent Writes

2020-10-19 Thread Balaji Varadarajan
We are planning to add parallel writing to Hudi (at different partition) levels in the next release. Balaji.V On Friday, October 16, 2020, 11:54:51 PM PDT, tanu dua wrote: Hi, Do we have a support of concurrent writes in 0.6 as I got a similar requirement to ingest parallely from

Re: Hudi Query Latest Records

2020-10-09 Thread Balaji Varadarajan
                      NULL    Bucket Columns:        []                      NULL    Sort Columns:          []                      NULL    Storage Desc Params:    NULL    NULL        serialization.format    1 On Fri, 9 Oct 2020 at 19:07, Balaji Varadarajan wrote: >  Can you paste the detailed h

Re: Hudi Query Latest Records

2020-10-09 Thread Balaji Varadarajan
in hive / hue? Regards, Ranganath On Thu, 1 Oct 2020 at 09:45, Balaji Varadarajan wrote: >  Assuming commit1 happened before commit2, this is what you should expect > when running a standard query through query engines. > Balaji.V > >    On Tuesday, September 29, 2020, 03:04:17 PM

Re: Hudi Query Latest Records

2020-09-30 Thread Balaji Varadarajan
Assuming commit1 happened before commit2, this is what you should expect when running a standard query through query engines. Balaji.V On Tuesday, September 29, 2020, 03:04:17 PM PDT, Ranganath Tirumala wrote: Hi, Is there a way we can query to get the latest record across commits?

Re: Apache Hudi Data Reconciliation

2020-09-12 Thread Balaji Varadarajan
Hi Jialun, There is no outside documentation for this case except Javadocs (https://issues.apache.org/jira/browse/HUDI-1277).  The payload interface are themselves first class citizens of Hudi ( 

Re: [Question] HoodieROTablePathFilter not accept dir path

2020-09-11 Thread Balaji Varadarajan
ath ending with `/` (a directory path). To me, this seems to be a corner case not being covered. Could you kindly confirm the expectation please? Thanks. On Tue, Sep 8, 2020 at 8:58 PM Balaji Varadarajan wrote: >  Hi Raymond, > IIRC, we need to give a blob path to make  HoodieROTablePathFilter to work

Re: Request to Add in Contributor list

2020-09-09 Thread Balaji Varadarajan
Added. Welcome to Hudi community.  Balaji.V On Tuesday, September 8, 2020, 09:31:37 PM PDT, Mani Jindal wrote: Hi team Please guide me how can i request for the contributor access for jira so that i can assign some jira tickets to myself and contribute to the hudi community. JIRA

Re: [Question] Redundant release tag?

2020-09-08 Thread Balaji Varadarajan
Deleted. Thanks,Balaji.VOn Tuesday, September 8, 2020, 08:51:36 PM PDT, Raymond Xu wrote: I think there is a mistakenly created version tag 0.60 in JIRA; the number does not seem to follow the release format. Anyone care to delete this?

Re: [Question] HoodieROTablePathFilter not accept dir path

2020-09-08 Thread Balaji Varadarajan
Hi Raymond, IIRC, we need to give a blob path to make  HoodieROTablePathFilter to work correctly (e.g: "base/partition/*"). The path-cache is at partition level and not at table level so we need to extract the partition-path correctly to be used as look-up key. To extract partition-path, the

Re: [DISCUSS] New Community Weekly Sync up Time

2020-09-08 Thread Balaji Varadarajan
+1 On Tuesday, September 8, 2020, 05:54:52 PM PDT, Mehrotra, Udit wrote: I am okay with this too. On 9/8/20, 5:33 PM, "Raymond Xu" wrote:     CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender

Re: schema compatibility check and change column type

2020-09-07 Thread Balaji Varadarajan
Hi Ji,  Moving this discussion to https://github.com/apache/hudi/issues/2063 which you have opened. I have added a possible workaround in the comments. Please try it out and respond in the issue.  Thanks,Balaji.V On Monday, September 7, 2020, 10:11:13 AM PDT, Jl Liu (cadl) wrote:

Re: Congrats to our newest committers!

2020-09-03 Thread Balaji Varadarajan
Udit, Gary, Raymond and Pratyaksh, Many congratulations :) Well deserved. Looking forward to your continued contributions. Balaji.V On Thursday, September 3, 2020, 07:19:45 PM PDT, Sivabalan wrote: Congrats to all 3. Much deserved and really excited to see more committers  On Thu,

Re: Coding guidelines

2020-09-02 Thread Balaji Varadarajan
+1. All current and future contributors/committers need to read this. Balaji.V On Wednesday, September 2, 2020, 01:11:46 AM PDT, vino yang wrote: +1 to have the coding guidelines. Left some comments. Best, Vino Vinoth Chandar 于2020年9月2日周三 上午9:51写道: > Hello all, > > Put together a

Re: [DISCUSS] Formalizing the release process

2020-09-01 Thread Balaji Varadarajan
+1 on the process. Balaji.VOn Tuesday, September 1, 2020, 04:56:55 PM PDT, Gary Li wrote: +1 Gary LiFrom: Bhavani Sudha Sent: Wednesday, September 2, 2020 3:11:06 AM To: us...@hudi.apache.org Cc: dev@hudi.apache.org Subject: Re: [DISCUSS] Formalizing the release process +1 on the

Re: HUDI-1232

2020-09-01 Thread Balaji Varadarajan
-executors 200 --executor-cores 1  --conf spark.executor.memoryOverhead=4096 --conf spark.shuffle.service.enabled=true  --class com.test.cdp.reporting.trr.TRREngine /home/seperiya/transformation-engine.jar Thanks, Selva On Sat, Aug 29, 2020 at 12:55 PM Balaji Varadarajan wrote: >  Hi Selvaraj, &g

Re: DevX, Test infra Rgdn

2020-08-31 Thread Balaji Varadarajan
+1. This would be a great contribution as all developers will benefit from this work.  On Monday, August 31, 2020, 08:07:08 AM PDT, Vinoth Chandar wrote: +1 this is a great way to also ramp on the code base On Sun, Aug 30, 2020 at 8:00 AM Sivabalan wrote: > As Hudi matures as a

Re: Hudi Writer vs Spark Parquet Writer - Sync

2020-08-31 Thread Balaji Varadarajan
Hi Felix,  For read side performance, we are focussed on adding clustering support (https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance) and consolidated metadata

Re: HUDI-1232

2020-08-29 Thread Balaji Varadarajan
dating       spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.   - *IMPORTANT* This version requires your runtime spark version to be   upgraded to 2.4+. Thanks, Selva On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan wrote: >  From the hudiLogs.txt, I find only HoodieROTable

Re: HUDI-1232

2020-08-29 Thread Balaji Varadarajan
From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs repeating which suggests this is the read side. So, we recommend you using latest version. I tried 2.3.3 and ran quickstart without issues. Give it a shot and let us know if there are any issues. Balaji.V On Friday,

Re: Null-value for required field Error

2020-08-23 Thread Balaji Varadarajan
M selvaraj periyasamy < selvaraj.periyasamy1...@gmail.com> wrote: > Thanks Balaji. > > could you please provide more info on how to get it done and pass it to > hudi? > > Thanks, > Selva > > On Fri, Aug 21, 2020 at 12:33 PM Balaji Varadarajan > wrote: > > &g

Re: [VOTE] Release 0.6.0, release candidate #1

2020-08-22 Thread Balaji Varadarajan
+1(binding) 1. Ran long running structured streaming writes on fake data and verified compactions and ingestion is happening without errors. 2. Ran both scala and python based quickstart without any errors. There was an issue in the documented quickstart steps (not in hudi) for python example.

Re: Incremental query on partition column

2020-08-21 Thread Balaji Varadarajan
Thanks for the detailed email David. We had discussed this in last week community meeting and Vinoth had ideas on how to implement this. This is something that can be supported by the timeline layout that Hudi has. It would be a new feature (new write operation) that basically appends the

Re: Null-value for required field Error

2020-08-21 Thread Balaji Varadarajan
Hi Selvaraj, Even though the incoming batch has non null values for the new column, existing data do not have this column. So, you need to make sure the avro schema has the new column to be nullable and be backwards compatible. Balaji.V On Friday, August 21, 2020, 10:06:40 AM PDT, selvaraj

Re: I want to contribute to Apache Hudi.

2020-08-20 Thread Balaji Varadarajan
Welcome Trevor to Hudi community. It looks like you have been added to the contributor role. Balaji.VOn Thursday, August 20, 2020, 11:07:47 AM PDT, wowtua...@gmail.com wrote: I want to contribute to Apache Hudi. Would you please give me the permission as a contributor ? My JIRA

Re: [DISCUSS] Support Spark Structured Streaming read from Hudi table

2020-08-20 Thread Balaji Varadarajan
Hi linshan, Sorry for the delay in responding. It is better to discuss code changes over draft PR. Can you open one and tag us there. At a high level, it looks like you are using Spark Datasource v2 APIs while currently the structured streaming write is implemented using V1 API. Let's discuss

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

2020-08-20 Thread Balaji Varadarajan
+1. This should be good to have as an option. If everybody agrees, please go ahead with RFC and we can discuss details there. Balaji.VOn Tuesday, August 18, 2020, 04:37:18 PM PDT, Abhishek Modi wrote: Hi everyone! I was hoping to discuss adding support for making `_hoodie_record_key`

Re: Kafka Hudi pipeline design

2020-07-21 Thread Balaji Varadarajan
Please see answers inline... On Sunday, July 19, 2020, 10:08:09 PM PDT, Lian Jiang wrote: Hi, I have a kafka topic using a kafka s3 connector to dump data into s3 hourly in parquet format. These parquet files are partitioned in ingestion time and each record has fields which are

Re: Date handling in HUDI

2020-07-21 Thread Balaji Varadarajan
Gary/Udit, As you are familiar with this part of it, Can you please answer this question ? Thanks,Balaji.VOn Monday, July 20, 2020, 08:18:16 AM PDT, tanu dua wrote: Hi Guys, May I know how do you guys handle date and time stamp in Hudi. When I set DataTypes as Date in StructType it’s

Re: Hard Delete

2020-07-17 Thread Balaji Varadarajan
Hi Sivaprakash, You can configure cleaner to clean the older file versions which contain those records to be deleted. You can take a look at  https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo  for more details. Balaji.V On Friday, July 17, 2020, 07:47:55 AM

Re: Handling delta

2020-07-16 Thread Balaji Varadarajan
Hi Sivaprakash, Uniqueness of records is determined by the record key you specify to hudi. Hudi supports filtering out existing records (by record key). By default, it would upsert all incoming records.  Please look at 

Re: [DISCUSS] Make delete marker configurable?

2020-06-29 Thread Balaji Varadarajan
+1  Sent from Yahoo Mail for iPhone On Monday, June 29, 2020, 5:34 PM, Vinoth Chandar wrote: +1 as well. (sorry , for jumping in late) On Sun, Jun 28, 2020 at 11:36 AM Shiyan Xu wrote: > Thanks for the +1. Filed https://issues.apache.org/jira/browse/HUDI-1058 > > On Sat, Jun 27, 2020 at

Re: How to extend the timeline server schema to accommodate business metadata

2020-05-31 Thread Balaji Varadarajan
Hi Mario, Timeline Server was designed to serve hudi metadata for Hudi writers and readers.  it may not be suitable to serve arbitrary data. But, it is an interesting thought. Can you elaborate more on what kind of business metadata are you looking. Is this something you are planning to store

Re: hudi dependency conflicts for test

2020-05-20 Thread Balaji Varadarajan
Thanks for using Hudi. Looking at pom definitions between 0.5.1 and 0.5.2, I don't see any difference that could cause this issue. As it works with 0.5.2, I am assuming you are not blocked. Let us know otherwise. Balaji.VOn Wednesday, May 20, 2020, 01:17:08 PM PDT, Lian Jiang wrote:

Re: Apache Hudi Graduation vote on general@incubator

2020-05-19 Thread Balaji Varadarajan
Terrific job :) We are marching on !! Balaji.V On Tuesday, May 19, 2020, 05:16:57 PM PDT, Sivabalan wrote: wow ! 19 binding votes. Great :) On Tue, May 19, 2020 at 1:55 AM lamber-ken wrote: > > > > Gread job! and good luck for apache hudi project. > > > > > Best, > Lamber-Ken > >

Re: [VOTE] Apache Hudi graduation to top level project

2020-05-06 Thread Balaji Varadarajan
> > created, the person holding such office to serve at the direction of the > > Board of Directors as the chair of the Apache Hudi Project, and to have > > primary responsibility for management of the projects within the scope of > > responsib

Re: [DISCUSS] Next Release timeline

2020-04-26 Thread Balaji Varadarajan
+1 on Sudha being RM and targeting next release for mid may. Balaji.V On 2020/04/23 14:27:46, Vinoth Chandar wrote: > Thanks all. Encourage everyone to chime in more, so we can make a decision > here! > > On Thu, Apr 23, 2020 at 6:29 AM Sivabalan wrote: > > > sounds good. We could go with a

Re: [DISCUSS] Bug bash?

2020-04-22 Thread Balaji Varadarajan
+1. Would also be great if folks sign-up for testing/trying out the master branch in their real environments  On Wednesday, April 22, 2020, 02:48:13 PM PDT, Bhavani Sudha wrote: +1 Sounds like a good idea On Wed, Apr 22, 2020 at 1:51 PM Vinoth Chandar wrote: > Just floating a very

Re: [DISCUSS] Support popular metrics reporter

2020-04-22 Thread Balaji Varadarajan
+1  On Wednesday, April 22, 2020, 08:35:30 AM PDT, leesf wrote: +1 Vinoth Chandar 于2020年4月22日周三 下午2:24写道: > +1 from me as well > > On Mon, Apr 20, 2020 at 9:37 PM vino yang wrote: > > > Hi Raymond, > > > > Thanks for opening this discussion. > > > > IMHO, as Hudi's user base grows,

Re: [DISCUSS] Insert Overwrite with snapshot isolation

2020-04-16 Thread Balaji Varadarajan
e. Let me know > your thoughts. It would be good to nail other details like whether/how to > deal with external index management with this API. > Thanks,Balaji.V >    On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan > wrote: > > > +1 from me. This is a really cool fe

Re: [DISCUSS] Insert Overwrite with snapshot isolation

2020-04-16 Thread Balaji Varadarajan
+1 from me. This is a really cool feature.  Yes, A new file slice (empty parquet) is indeed generated for every file group in a partition.  Regarding cleaning these "empty" file slices eventually by cleaner (to avoid cases where there are too many of them lying around) in a safe way, we can

Re: New PPMC Member : Bhavani Sudha

2020-04-07 Thread Balaji Varadarajan
Congratulations Sudha :) Well deserved.  Welcome to PPMC.  Balaji.V On Tuesday, April 7, 2020, 03:04:37 PM PDT, Gary Li wrote: Congrats Sudha! Appreciated all the work you have done! On Tue, Apr 7, 2020 at 2:57 PM Y Ethan Guo wrote: > Congrats!!! > > On Tue, Apr 7, 2020 at 2:55 PM

Re: New Committer: lamber-ken

2020-04-07 Thread Balaji Varadarajan
Many Congratulations Lamber-Ken.  Well deserved !! Balaji.V On Tuesday, April 7, 2020, 02:23:51 PM PDT, Y Ethan Guo wrote: Congrats!!! On Tue, Apr 7, 2020 at 2:22 PM Gary Li wrote: > Congrats lamber! Well deserved! > > On Tue, Apr 7, 2020 at 2:18 PM Vinoth Chandar wrote: > > >

Re: [DISSCUSS] Troubleshooting flow

2020-04-06 Thread Balaji Varadarajan
Agree. The triaging process makes sense to me. Balaji.V On Monday, April 6, 2020, 09:54:24 AM PDT, Vinoth Chandar wrote: Hi, I feel there are couple of action items here.. a) JIRA to track work for slack-ML integration b) Document the support triaging process : Slack (level 1) ->

Re: Query regarding restoring HUDI tables to older commits

2020-03-22 Thread Balaji Varadarajan
gt; > proceeding". So probably the embedded timeline server can recreate the > view > > next time it comes back up? > > > > Thanks > > Prashant > > > > > > On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan > > wrote: >

Re: Could not load key generator class org.apache.hudi.ComplexKeyGenerator

2020-03-21 Thread Balaji Varadarajan
With 0.5.1, the key-generator classes are relocated to  org.apache.hudi.keygen. You can find the information in release notes in  https://hudi.incubator.apache.org/releases.html#release-051-incubating-docs Balaji.VOn Saturday, March 21, 2020, 01:47:48 PM PDT, FO O wrote: Hi, When

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Balaji Varadarajan
Prashanth, I think we should not be reverting clean operations here. Cleans are done on the oldest file slices and a restore/rollback is not completely undoing the work of clean that happened before it.  For incremental timeline syncing, embedded timeline server needs to read these clean

Re: [DISCUSS] Restructure hudi-utilities module

2020-03-09 Thread Balaji Varadarajan
+1 on Vinoth's suggestion on waiting for the lower level (write-client) re-factored and re-organized first.  We can then look at Data-Source and DeltaStreamer to make sure how to best organize them.  Balaji.VOn Sunday, March 8, 2020, 11:06:13 PM PDT, Vinoth Chandar wrote: >> make

Re: [ANNOUNCE] Code is frozen for next release(0.5.2)

2020-02-29 Thread Balaji Varadarajan
+1 on cutting the branch.  Vino, let us know in this thread if you run into any problems in the release process. Balaji. V Sent from Yahoo Mail for iPhone On Saturday, February 29, 2020, 9:19 AM, Vinoth Chandar wrote: Great!  Can we cut the release candidate branch 0.5.2 right away so that

Re: Need clarity on these test cases in TestHoodieDeltaStreamer

2020-02-27 Thread Balaji Varadarajan
Awesome Pratyaksh, would you mind opening a PR to documenting it. Balaji.V Sent from Yahoo Mail for iPhone On Wednesday, February 26, 2020, 11:14 PM, Pratyaksh Sharma wrote: Hi, I figured out the issue yesterday. Thank you for helping me out. On Thu, Feb 27, 2020 at 12:01 AM

Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

2020-02-25 Thread Balaji Varadarajan
+1. Lets do it :) Balaji.V On Mon, Feb 24, 2020 at 6:36 PM Shiyan Xu wrote: > +1 great reading and values! > > On Mon, 24 Feb 2020, 15:31 nishith agarwal, wrote: > > > +100 > > - Reduces index lookup time hence improves job runtime > > - Paves the way for streaming style ingestion > > -

Re: [DISCUSS] Support for complex record keys with TimestampBasedKeyGenerator

2020-02-25 Thread Balaji Varadarajan
See if you can have a generic implementation where individual fields in the partition-path can be configured with their own key-generator class. Currently, TimestampBasedKeyGenerator is the only type specific custom generator. If we are anticipating more such classes for specialized types,

Re: updatePartitionsToTable() is time consuming and redundant.

2020-02-16 Thread Balaji Varadarajan
t; So if I'm not wrong, the code will be marking all partitions which got > > UPDATE data for partition update. Hence time consuming. > > > > Regards, > > Purushotham Pushpavanth > > > > > > > > On Mon, 20 Jan 2020 at 08:58, Balaji Varadaraja

Re: [DISCUSS] Redraw of hudi data lake architecture diagram on langing page

2020-01-23 Thread Balaji Varadarajan
+1 as well. Looks great. Balaji.V On Thursday, January 23, 2020, 08:17:47 AM PST, Vinoth Chandar wrote: Looks good . +1 ! On Wed, Jan 22, 2020 at 11:44 PM lamberken wrote: > > > Hello everyone, > > > I redrawed the hudi data lake architecture diagram on landing page. If you > have

Re: [VOTE] Release 0.5.1-incubating, release candidate #1

2020-01-22 Thread Balaji Varadarajan
+1 (binding) Ran the following validation steps: 1. Checked out RC candidate source code and compiled successfully 2. Ran Apache Hudi quickstart steps successfully on 0.5.1-rc1 3. Ran Long running deltastreamer test for a half day without any exceptions. 4. Compliance : Ran

Re: Would not Stage source releases on dist.apache.org

2020-01-20 Thread Balaji Varadarajan
-depth=immediates* leesf 于2020年1月21日周二 下午3:07写道: > Hi balaji, > > I would not find entrypoint to create a folder under dev/incubator/hudi, > have no permissions? Please advise. Thanks. > > Balaji Varadarajan 于2020年1月21日周二 下午2:14写道: > >> >> Hi Leesf, >> TH

Re: Would not Stage source releases on dist.apache.org

2020-01-20 Thread Balaji Varadarajan
Hi Leesf, THe staging directories are intentionally empty. The directories corresponding to 0.5.0-incubating release were deleted from staging directory as the last step of the release. You can create a folder "0.5.1-incubating" under dev/incubator/hudi and add the source release tar balls

Re: updatePartitionsToTable() is time consuming and redundant.

2020-01-19 Thread Balaji Varadarajan
Hi Purushotham, I am unable to reproduce same  partitions getting hive-synced locally. Can you add the following log message in HoodieHiveClient.java and run the code and send us logs. diff --git a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

Re: [DISCUSS] Delay code freeze date for next release until Jan 19th (Sunday)

2020-01-15 Thread Balaji Varadarajan
+1 Sunday should give breathing space to fix the blockers. Balaji.V On Wednesday, January 15, 2020, 06:50:28 AM PST, Vinoth Chandar wrote: +1 from me. I feel sunday is good in general, because the weekend gives enough time for taking care of last minute things On Wed, Jan 15, 2020 at

Re: [DISCUSS] Hudi weekly community update

2020-01-06 Thread Balaji Varadarajan
IIUC, this would look like a digest email summarizing discussion threads, jira and PR activities.  +1 Balaji.V On Sunday, January 5, 2020, 07:49:22 AM PST, leesf wrote: Hi all, As Hudi attracts more attention recently and the community is developing quickly as more and more

Re: Permession for contribute to Apache Hudi

2020-01-02 Thread Balaji Varadarajan
Added your id.  Looking forward towards your contributions :)  Welcome !! Balaji.V On Thursday, January 2, 2020, 05:44:51 PM PST, 谢雄 wrote: Hi, I want to contribute to Apache Hudi. Would you please give me the contributor permission? My JIRA ID is helloteddy.

Re: Contribution guidelines

2019-12-29 Thread Balaji Varadarajan
+1 Thanks for doing this Vinoth. Covers all aspects of contribution in detail. Big +1 to code/RFC review etiquettes. Balaji.V On Sat, Dec 28, 2019 at 7:20 PM vino yang wrote: > Hi Vinoth, > > big +1 from my side. > > Thanks for spending time improving the contribution guidelines. > > It looks

Re: Re:Re: [DISCUSS] RFC-12 : Efficient migration of large parquet tables to Apache Hudi

2019-12-15 Thread Balaji Varadarajan
t the plan into multiple subtasks? Thanks, Nicholas At 2019-12-14 00:18:12, "Vinoth Chandar" wrote: >+1 (per asf policy) > >+100 per my own excitement :) .. Happy to review this! > >On Fri, Dec 13, 2019 at 3:07 AM Balaji Varadarajan >wrote: > >> With Apache Hud

Re: [DISCUSS] Default partition path in TimestampBasedKeyGenerator

2019-12-13 Thread Balaji Varadarajan
Thanks Shahidha for the quick response. Pratyaksh, I am ok with making the behavior consistent with other Key generators. Please go ahead and submit a PR. Thanks, Balaji.V On Thu, Dec 12, 2019 at 10:34 PM Pratyaksh Sharma wrote: > Hi Shahida, > > Thank you for the clarification. Actually I

[DISCUSS] RFC-12 : Efficient migration of large parquet tables to Apache Hudi

2019-12-13 Thread Balaji Varadarajan
With Apache Hudi growing in popularity, one of the fundamental challenges for users has been about efficiently migrating their historical datasets to Apache Hudi. Apache Hudi maintains per record metadata to perform core operations such as upserts and incremental pull. To take advantage of Hudi’s

[DISCUSS] Next Apache Release

2019-12-11 Thread Balaji Varadarajan
Hello all, In the spirit of making Apache Hudi (incubating) releases at regular cadence, we are starting this thread to kickstart the planning and preparatory work for next release (0.5.1). As discussed in yesterdays meeting, the current plan is to have a release by end of Jan 2020. As

Today's meeting cancelled

2019-12-03 Thread Balaji Varadarajan
I have cancelled the weekly (9 pm PST) meeting just now. I guess many of us are traveling or in vacation. We will meet next week same time Balaji.V

Re: Issue while querying Hive table after updates

2019-11-20 Thread Balaji Varadarajan
Hi Gurudatt, >From the stack-trace, it looks like you are using CombineInputFormat as your default input format for the hive session. If your intention is to use combined input format, can you instead try setting default (set hive.input.format=) to

Re: Small clarification in Hoodie Cleaner flow

2019-11-19 Thread Balaji Varadarajan
I updated the FAQ section to set defaults correctly and add more information related to this : https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo The cleaner retention configuration is based on counts (number of commits to be retained) with the assumption that users

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-12 Thread Balaji Varadarajan
+1 on the exporter tool idea. On Mon, Nov 11, 2019 at 10:36 PM vino yang wrote: > Hi Shiyan, > > +1 for this proposal, Also, it looks like an exporter tool. > > @Vinoth Chandar Any thoughts about where to place it? > > Best, > Vino > > Vinoth Chandar 于2019年11月12日周二 上午8:58写道: > > > We can

Re: Migrate Existing DataFrame to Hudi DataSet

2019-11-12 Thread Balaji Varadarajan
Regarding (1) , As the exception is happening inside parquet reader (outside hudi), can you use Spark 2.3 (instead of spark 2.4 which brings in particular version of avro/parquet) to create and ingest a brand new dataset and try it out. This would hopefully help isolate the issue. Regarding (2),

Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread Balaji Varadarajan
Agree with all 3 changes. The naming now looks more consistent than earlier. +1 on them Depending on whether we are renaming Input formats for (1) and (2) - this could require some migration steps for Balaji.V On Mon, Nov 11, 2019 at 7:38 PM vino yang wrote: > Hi Vinoth, > > Thanks for

Re: DISCUSS RFC 7 - Point in time queries on Hudi table (Time-Travel)

2019-11-11 Thread Balaji Varadarajan
+1. This would be a powerful feature which would open up use-cases requiring repeatable query results. Balaji.V On Mon, Nov 11, 2019 at 8:12 AM nishith agarwal wrote: > Folks, > > Starting a discussion thread for enabling time-travel for Hudi datasets. > Please provide feedback on the RFC

Re: [Discuss] Feedback on Hudi improvements

2019-11-08 Thread Balaji Varadarajan
Brandon, Great initiative and thoughts. Thanks for writing detailed description on what you are looking to achieve. Here are some of my comments/thoughts: 1. HUDI-326 : There is some work that is happening in this direction. But, we should be able to collaborate on this. Siva has opened

New Committer : bhavanisudha

2019-11-07 Thread Balaji Varadarajan
Hello Apache Hudi Community, The Podling Project Management Committee (PPMC) for Apache Hudi (Incubating) has invited Bhavani Sudha Saktheeswaran to become a committer and we are pleased to announce that she has accepted. Bhavani Sudha has made great impact by fixing critical issues in hudi,

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-06 Thread Balaji Varadarajan
Thanks Sudha. The following times work for me : Mon, Tue, Thursday - 9 p.m to 12 a.m PST Wed - 5:00 to 6:00 am and 9:30 p.m to 12 a.m PST On Wed, Nov 6, 2019 at 12:31 PM Vinoth Chandar wrote: > Interested. > > Mon-Thu 5AM-6:30AM PST > Mon-Thu 9PM-10:30PM PST > > > On Wed, Nov 6, 2019 at

Re: [Discuss] Creation of database in Hive

2019-11-06 Thread Balaji Varadarajan
I have a different opinion on this. Usually, in production deployments (atleast whatever I am aware of), database is generally managed at the org/group level. Privacy policies like ACLs are usually done at database level and would need first level management by admins. With such a setup, its

  1   2   >