Re: Need clarity on these test cases in TestHoodieDeltaStreamer

2020-03-02 Thread Sivabalan
I will sync up with Pratyaksh offline on this.

On Thu, Feb 27, 2020 at 11:24 PM Pratyaksh Sharma 
wrote:

> Hi Balaji,
>
> Right now I am facing some different issue in the same test case. The
> number of records are not matching and assertion is failing. Once I am able
> to fix that as well, I will open the PR for sure. :)
>
> On Thu, Feb 27, 2020 at 11:17 PM Balaji Varadarajan
>  wrote:
>
>>
>> Awesome Pratyaksh, would you mind opening a PR to documenting it.
>> Balaji.V
>>
>> Sent from Yahoo Mail for iPhone
>>
>>
>> On Wednesday, February 26, 2020, 11:14 PM, Pratyaksh Sharma <
>> pratyaks...@gmail.com> wrote:
>>
>> Hi,
>>
>> I figured out the issue yesterday. Thank you for helping me out.
>>
>> On Thu, Feb 27, 2020 at 12:01 AM vbal...@apache.org 
>> wrote:
>>
>> >
>> > This change was done as part of adding delete API support :
>> >
>> https://github.com/apache/incubator-hudi/commit/7031445eb3cae5a4557786c7eb080944320609aa
>> >
>> > I don't remember the reason behind this.
>> > Sivabalan, Can you explain the reason when you get a chance.
>> > Thanks,Balaji.V
>> >On Wednesday, February 26, 2020, 06:03:53 AM PST, Pratyaksh Sharma <
>> > pratyaks...@gmail.com> wrote:
>> >
>> >  Anybody got a chance to look at this?
>> >
>> > On Mon, Feb 24, 2020 at 1:04 AM Pratyaksh Sharma > >
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > While working on one of my PRs, I am stuck with the following test
>> cases
>> > > in TestHoodieDeltaStreamer -
>> > > 1. testUpsertsCOWContinuousMode
>> > > 2. testUpsertsMORContinuousMode
>> > >
>> > > For both of them, at line [1] and [2], we are adding 200 to
>> totalRecords
>> > > while asserting record count and distance count respectively. I am
>> unable
>> > > to understand what do these 200 records correspond to. Any leads are
>> > > appreciated.
>> > >
>> > > I feel probably I am missing some piece of code where I need to do
>> > changes
>> > > for the above tests to pass.
>> > >
>> > > [1]
>> > >
>> >
>> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L425
>> > > .
>> > > [2]
>> > >
>> >
>> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L426
>> > > .
>> > >
>> > >
>> >
>>
>>
>>
>>

-- 
Regards,
-Sivabalan


Re: Test coverage is now integrated to codecov.io

2020-03-02 Thread Ramachandran Madras Subramaniam
Thank you all for your kind words :)

An update on the issues. I am still seeing some PRs reporting zero coverage
for the forked branch and hence a drop of 60%+ coverage. Opened a ticket
with codecov

today to understand this issue better.

Also you might see some of the PRs not pulling up any coverage for master.
This is due to the fact that those PRs have not rebased to current master
and have opened the diff against an older commit in master which doesn't
have any data in codecov. This should go away if these PRs are rebased. It
is not mandatory to rebase as of now as this problem will fade away
eventually on new PRs.

-Ram

On Sun, Mar 1, 2020 at 8:20 PM Bhavani Sudha 
wrote:

> This is super useful. Thanks Ramachandran!
>
> -Sudha
>
> On Sat, Feb 29, 2020 at 7:42 PM leesf  wrote:
>
> > Great job, thanks for your work.
> >
> > Sivabalan  于2020年2月29日周六 下午12:02写道:
> >
> > > Good job! thanks for adding.
> > >
> > > On Fri, Feb 28, 2020 at 5:41 PM vino yang 
> wrote:
> > >
> > > >  Hi Ram,
> > > >
> > > > Thanks for your great work to make the code coverage clear.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > Vinoth Chandar  于2020年2月29日周六 上午4:39写道:
> > > >
> > > > > Thanks Ram! This will definitely help improve the code quality over
> > > time!
> > > > >
> > > > > On Fri, Feb 28, 2020 at 9:45 AM Ramachandran Madras Subramaniam
> > > > >  wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Diff 1347 <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_pull_1347=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=KLmNyF_KPBPNb-BIVUsy8j_1tYfqyNa57jwVia1c9kM=SS0RbqE858fB7dZFTDERnraMoIystkopIUY-jADgVHs=LWGjgAlb8k98t_HYrdUbYZ-rjQhDfVPRUYzXafRsJNA=
> > was
> > > > > merged
> > > > > > into master yesterday. This enables visibility into code coverage
> > of
> > > > hudi
> > > > > > in general and also provides insights into differential coverage
> > > during
> > > > > > peer reviews.
> > > > > >
> > > > > > Since this is very recent and is getting integrated, you might
> see
> > > some
> > > > > > partial results in your diff. There can be 2 scenarios here,
> > > > > >
> > > > > > 1. Your diff is not rebased with latest master and hence the code
> > > > > coverage
> > > > > > report was not generated. To solve this issue, you just have to
> > > rebase
> > > > to
> > > > > > latest master.
> > > > > > 2. Code coverage ran but reported as zero. Three was one diff
> > (#1350)
> > > > > where
> > > > > > we saw this issue yesterday. This in general shouldn't happen.
> > Could
> > > > have
> > > > > > been due to an outage in codecov website. I will be monitoring
> > > upcoming
> > > > > > diffs for the near future to see if this problem persists. Please
> > > ping
> > > > me
> > > > > > in the diff if you have any questions/concerns regarding code
> > > coverage.
> > > > > >
> > > > > > Thanks,
> > > > > > Ram
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>


Re: Re: Re: [DISCUSS] Improve the merge performance for cow

2020-03-02 Thread Vinoth Chandar
Hi Lamber-ken,

If you agree reduceByKey() will shuffle data, then it would serialize and
deserialize anyway correct?

I am not denying that this may be a valid approach.. But we need much more
rigorous testing and potentially implement both approaches side-by-side to
compare.. IMO We cannot conclude based on this on the one test we had -
where the metadata overhead was so high . First step would be to introduce
abstractions so that these two ways can be implemented side-by-side and
controlled by a flag..

Also let's separate the RDD vs DataFrame discussion out of this? Since that
orthogonal anyway..

Thanks
Vinoth


On Fri, Feb 28, 2020 at 11:02 AM lamberken  wrote:

>
>
> Hi vinoth,
>
>
> Thanks for reviewing the initial design :)
> I know there are many problems at present(e.g shuffling, parallelism
> issue). We can discussed the practicability of the idea first.
>
>
> > ExternalSpillableMap itself was not the issue right, the serialization
> was
> Right, the new design will not have this issue, because will not use it at
> all.
>
>
> > This map is also used on the query side
> Right, the proposal aims to improve the merge performance of cow table.
>
>
> > HoodieWriteClient.java#L546 We cannot collect() the recordRDD at all ...
> OOM driver
> Here, in order to get the Map, had executed distinct()
> before collect(), the result is very small.
> Also, it can be implemented in FileSystemViewManager, and lazy loading
> also ok.
>
>
> > Doesn't this move the problem to tuning spark simply?
> there are two serious performance problems in the old merge logic.
> 1, when upsert many records, it will serialize record to disk, then
> deserialize it when merge old record
> 2, only single thread comsume the old record one by one, then handle the
> merge process, it is much less efficient.
>
>
> > doing a sort based merge repartitionAndSortWithinPartitions
> Trying to understand your point :)
>
>
> Compare to old version, may there are serveral improvements
> 1. use spark built-in operators, it's easier to understand.
> 2. during my testing, the upsert performance doubled.
> 3. if possible, we can write data in batch by using Dataframe in the
> futher.
>
>
> [1]
> https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
>
>
> Best,
> Lamber-Ken
>
>
>
>
>
>
>
>
>
> At 2020-02-29 01:40:36, "Vinoth Chandar"  wrote:
> >Does n't this move the problem to tuning spark simply? the
> >ExternalSpillableMap itself was not the issue right, the serialization
> >was.  This map is also used on the query side btw, where we need something
> >like that.
> >
> >I took a pass at the code. I think we are shuffling data again for the
> >reduceByKey step in this approach? For MOR, note that this is unnecessary
> >since we simply log the. records and there is no merge. This approach
> might
> >have a better parallelism of merging when that's costly.. But ultimately,
> >our write parallelism is limited by number of affected files right?  So
> its
> >not clear to me, that this would be a win always..
> >
> >On the code itself,
> >
> https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java#L546
> > We cannot collect() the recordRDD at all.. It will OOM the driver .. :)
> >
> >Orthogonally, one thing we think of is : doing a sort based merge.. i.e
> >repartitionAndSortWithinPartitions()  the input records to mergehandle,
> and
> >if the file is also sorted on disk (its not today), then we can do a
> >merge_sort like algorithm to perform the merge.. We can probably write
> code
> >to bear one time sorting costs... This will eliminate the need for memory
> >for merging altogether..
> >
> >On Wed, Feb 26, 2020 at 10:11 PM lamberken  wrote:
> >
> >>
> >>
> >> hi, vinoth
> >>
> >>
> >> > What do you mean by spark built in operators
> >> We may can not depency on ExternalSpillableMap again when upsert to cow
> >> table.
> >>
> >>
> >> > Are you suggesting that we perform the merging in sql
> >> No, just only use spark built-in operators like mapToPair, reduceByKey
> etc
> >>
> >>
> >> Details has been described in this article[1], also finished draft
> >> implementation and test.
> >> mainly modified HoodieWriteClient#upsertRecordsInternal method.
> >>
> >>
> >> [1]
> >>
> https://docs.google.com/document/d/1-EHHfemtwtX2rSySaPMjeOAUkg5xfqJCKLAETZHa7Qw/edit?usp=sharing
> >> [2]
> >>
> https://github.com/BigDataArtisans/incubator-hudi/blob/new-cow-merge/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
> >>
> >>
> >>
> >> At 2020-02-27 13:45:57, "Vinoth Chandar"  wrote:
> >> >Hi lamber-ken,
> >> >
> >> >Thanks for this. I am not quite following the proposal. What do you
> mean
> >> by
> >> >spark built in operators? Dont we use the RDD based spark operations.
> >> >
> >> >Are you suggesting that we perform the merging in sql? Not following.
> >> >Please clarify.
> >> >
> >> >On Wed, 

Re: Hudi 0.5.0 -> Hive JDBC call fails

2020-03-02 Thread Vinoth Chandar
Hi Selva,

See if this helps.
https://lists.apache.org/thread.html/e1fd539ac438276dd7feb2bc813bf85f84a95f7f25b638488eb2e110%40%3Cdev.hudi.apache.org%3E

Its long thread, but you can probably skim to the last few conversations
around Hive 1.x

Thanks
Vinoth

On Sun, Mar 1, 2020 at 5:26 PM selvaraj periyasamy <
selvaraj.periyasamy1...@gmail.com> wrote:

> Thanks Vinoth. We do have plan to move hive 2.x version in near future.
> Can I get any info on the workaround for hive 1.x versions?
>
> Thanks,
> Selva
>
> On Sun, Mar 1, 2020 at 3:19 PM Vinoth Chandar  wrote:
>
> > We have dropped support for Hive 1.x, a while back. Would you be able to
> > move to Hive 2.x?
> >
> > IIRC there were some workarounds discussed on this thread before. But,
> > given the push towards Hive 3.x, its good to be on 2.x atleast ..
> > Let me know and we can go from there :)
> >
> > On Sun, Mar 1, 2020 at 1:09 PM selvaraj periyasamy <
> > selvaraj.periyasamy1...@gmail.com> wrote:
> >
> > > I am using Hudi 0.5.0 and then write using sparkwriter.
> > >
> > > My spark version is 2.3.0
> > > Scala version 2.11.8
> > > Hive version 1.2.2
> > >
> > > Write is success but hive call is failing. When checked some google
> > > reference, It seems to be an hive client is higher version the server.
> > > Since Hudi is built on hive 2.3.1, Is there a way to use 1.2.2?
> > >
> > > 2020-03-01 12:16:50 WARN  HoodieSparkSqlWriter$:110 - hoodie dataset at
> > > hdfs://localhost:9000/projects/cdp/data/attunity_poc/attunity_rep_base
> > > already exists. Deleting existing data & overwriting with new data.
> > > [Stage 111:>
> > >  2020-03-01
> > > 12:16:51 ERROR HiveConnection:697 - Error opening session
> > > org.apache.thrift.TApplicationException: Required field
> 'client_protocol'
> > > is unset! Struct:TOpenSessionReq(client_protocol:null,
> > >
> > >
> >
> configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000,
> > > use:database=default})
> > > at
> > >
> > >
> >
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> > > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
> > > at
> > >
> > >
> >
> org.apache.hudi.org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_OpenSession(TCLIService.java:168)
> > > at
> > >
> > >
> >
> org.apache.hudi.org.apache.hive.service.rpc.thrift.TCLIService$Client.OpenSession(TCLIService.java:155)
> > >
> > >
> > > Thanks,
> > > Selva
> > >
> >
>


Re: Contributor permission application

2020-03-02 Thread Vinoth Chandar
Welcome! Added you to JIRA perms

On Mon, Mar 2, 2020 at 4:51 AM 1101300123  wrote:

> Hi,
>
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is yutaochina.
>
>
>
> | |
> 1101300123
> |
> |
> hdxg1101300...@163.com
> |
> 签名由网易邮箱大师定制


Contributor permission application

2020-03-02 Thread 1101300123
Hi,

I want to contribute to Apache Hudi. Would you please give me the contributor 
permission? My JIRA ID is yutaochina.



| |
1101300123
|
|
hdxg1101300...@163.com
|
签名由网易邮箱大师定制