Re: [ANNOUNCE] Welcome Daniel Chen as Samza Committer

2021-09-20 Thread Hai Lu
Congrats!!!

On Fri, Sep 17, 2021 at 2:06 PM Sanil Jain  wrote:

> Congrats Daniel !
>
> On Fri, 17 Sept 2021 at 11:40, Jagadish Venkatraman <
> jagadish1...@gmail.com>
> wrote:
>
> > Congrats Daniel on this well deserved recognition.
> >
> > Look forward to more contributions!
> >
> >
> > On Fri, Sep 17, 2021 at 11:25 AM Yi Pan  wrote:
> >
> > > Congrats, Daniel, well deserved!!!
> > >
> > > -Yi
> > >
> > > On Fri, Sep 17, 2021 at 11:23 AM Xinyu Liu 
> > wrote:
> > >
> > > > Hi, all,
> > > >
> > > > I am glad to announce that Daniel Chen has officially accepted our
> > > > invitation and become an Apache Samza Committer now.
> > > >
> > > > Daniel has contributed to many areas of Samza, from his early work on
> > > > Eventhub connector, to recently state restore and checkpointing
> > > > improvements. Daniel also contributed tremendously to integrate
> Apache
> > > Beam
> > > > Python API on top of Samza. As an active member in Samza, he has
> > > > participated frequently in the design, code reviews and mailing list
> > > > discussions. He has also contributed to Samza tutorials, website,
> > > releases
> > > > and bug fixes.
> > > >
> > > > Considering his contributions, the Samza PMC trusts Daniel with the
> > > > responsibilities of a Samza Committer.
> > > >
> > > > Please join me to give him a warm welcome!
> > > >
> > > > Xinyu Liu
> > > > on behalf of the Apache Samza PMC
> > > >
> > >
> >
> >
> > --
> > -- Jagadish
> >
>


Apache Samza 1.3.1 is released

2020-02-20 Thread Hai Lu
Documentation and Blog are published. See the announcement from samza
website

and
apache blog 

Thanks,
Hai


[RESULT] [VOTE] Apache Samza 1.3.1 RC0

2020-02-20 Thread Hai Lu
The vote has passed. Thanks!

+1 (binding) x 4: Yi Pan, Bharath, Prateek Maheshwari, Jagadish

-- Forwarded message -
From: Jagadish Venkatraman 
Date: Wed, Feb 19, 2020 at 8:32 PM
Subject: Re: [VOTE] Apache Samza 1.3.1 RC0
To: dev@samza.apache.org 


+1 (binding)

On Wednesday, February 19, 2020, Prateek Maheshwari 
wrote:

> Integration tests and check-all passed successfully. +1 (binding) from me.
>
> Thanks,
> Prateek
>
> On Tue, Feb 18, 2020 at 12:47 PM Bharath Kumara Subramanian <
> codin.mart...@gmail.com> wrote:
>
> > Ran check-all and integration tests passed with one exception.
> > The following test was flaky and passed on the second attempt for a
> > specific combination (Scala - 2.12, YARN - 2.7.1 & JDK 8)
> >
> > > Task :samza-test_2.12:test
> >
> >
> > testRepartitionedSessionWindowCounter FAILED
> >
> > java.lang.AssertionError: expected:<0> but was:<2>
> >
> > at org.junit.Assert.fail(Assert.java:88)
> >
> > at org.junit.Assert.failNotEquals(Assert.java:834)
> >
> > at org.junit.Assert.assertEquals(Assert.java:645)
> >
> > at org.junit.Assert.assertEquals(Assert.java:631)
> >
> > at
> >
> > org.apache.samza.test.operator.TestRepartitionWindowApp.
> testRepartitionedSessionWindowCounter(TestRepartitionWindowApp.java:72)
> >
> > We can follow it up with a ticket if we someone else runs into it as
> well.
> >
> > +1 (binding)
> >
> > Thanks,
> > Bharath
> >
> > On Tue, Feb 18, 2020 at 12:12 AM Yi Pan  wrote:
> >
> > > Ran check-all and integration tests successfully.
> > >
> > > +1 (binding)
> > >
> > > On Thu, Feb 13, 2020 at 12:02 PM Hai Lu  wrote:
> > >
> > > > Hi,
> > > >
> > > > This is a call for a vote on a release of Apache Samza 1.3.1 to
> redress
> > > > certain issues found in 1.3.0
> > > >
> > > > The release candidate can be downloaded from here:
> > > > http://home.apache.org/~lhaiesp/samza-1.3.1-rc0/
> > > >
> > > > The release candidate is signed with pgp key 0x256F8FA2, which can
be
> > > found
> > > > here:
> > > >
> > > >
> > >
> > https://keyserver.ubuntu.com/pks/lookup?search=0x256F8FA2;
> fingerprint=on=index
> > > > or to directly see the public key here:
> > > >
> > > >
> > >
> > https://keyserver.ubuntu.com/pks/lookup?op=get=
> 0x9ebc0889d43fae16dd0d8f5ba2f50cf4256f8fa2
> > > >
> > > > The git tag is release-1.3.1-rc0 and signed with the same pgp key
> > above:
> > > >
> > > >
> > >
> > https://gitbox.apache.org/repos/asf?p=samza.git;a=commit;h=
> 7b849c047827587dec55ac169f41aac7321ce1cb
> > > >
> > > > Test binaries have been published to Maven's staging repository, and
> > are
> > > > available here:
> > > > https://repository.apache.org/content/repositories/
> orgapachesamza-1074
> > > >
> > > > The vote will be open for 128 hours (ending at 8:00 PM PST Tuesday,
> > > > 2/18/2020).
> > > >
> > > > Please download the release candidate, check the hashes/signature,
> > build
> > > it
> > > > and test it, and then please vote:
> > > >
> > > > [ ] +1 approve
> > > >
> > > > [ ] +0 no opinion
> > > >
> > > > [ ] -1 disapprove (and reason why)
> > > >
> > > > I ran check-all.sh and integration tests (both YARN and standalone).
> > > >
> > > > +1 (non-binding) from my side.
> > > >
> > > > Thanks,
> > > > Hai
> > > >
> > >
> >
>


-- 
Jagadish


[VOTE] Apache Samza 1.3.1 RC0

2020-02-13 Thread Hai Lu
Hi,

This is a call for a vote on a release of Apache Samza 1.3.1 to redress
certain issues found in 1.3.0

The release candidate can be downloaded from here:
http://home.apache.org/~lhaiesp/samza-1.3.1-rc0/

The release candidate is signed with pgp key 0x256F8FA2, which can be found
here:
https://keyserver.ubuntu.com/pks/lookup?search=0x256F8FA2=on=index
or to directly see the public key here:
https://keyserver.ubuntu.com/pks/lookup?op=get=0x9ebc0889d43fae16dd0d8f5ba2f50cf4256f8fa2

The git tag is release-1.3.1-rc0 and signed with the same pgp key above:
https://gitbox.apache.org/repos/asf?p=samza.git;a=commit;h=7b849c047827587dec55ac169f41aac7321ce1cb

Test binaries have been published to Maven's staging repository, and are
available here:
https://repository.apache.org/content/repositories/orgapachesamza-1074

The vote will be open for 128 hours (ending at 8:00 PM PST Tuesday,
2/18/2020).

Please download the release candidate, check the hashes/signature, build it
and test it, and then please vote:

[ ] +1 approve

[ ] +0 no opinion

[ ] -1 disapprove (and reason why)

I ran check-all.sh and integration tests (both YARN and standalone).

+1 (non-binding) from my side.

Thanks,
Hai


[DISCUSS] Samza 1.3.1 release

2020-02-13 Thread Hai Lu
Hi all,

We're going to make a 1.3.1 release to address some critical issues that
were found in 1.3.0

1.3.1 will be based off 1.3.0 but include the following additional commits:

SAMZA-2447: Checkpoint dir removal should only search in valid store dirs
(#1261)
SAMZA-2446: Invoke onCheckpoint only for registered SSPs (#1260)
SAMZA-2431: Fix the checkpoint and changelog topic auto-creation. (#1251)
SAMZA-2434: Fix the coordinator steam creation workflow
SAMZA-2423: Heartbeat failure causes incorrect container shutdown (#1240)
SAMZA-2305: Stream processor should ensure previous container is stopped
during a rebalance (#1213)

I'm going to create the release candidate soon.

Thanks,
Hai


Apache Samza 1.3.0 is released

2019-12-10 Thread Hai Lu
Documentation and Blog are published. See the announcement here


Thanks,
Hai


[RESULT] [VOTE] Apache Samza 1.3.0 RC2

2019-12-05 Thread Hai Lu
Thanks everyone. The vote has passed.

+1 (binding) x 3: Yi Pan, Xinyu Liu, Prateek Maheshwari

-- Forwarded message -
From: Prateek Maheshwari 
Date: Wed, Dec 4, 2019 at 3:25 PM
Subject: Re: [VOTE] Apache Samza 1.3.0 RC2
To: 


+ 1 (binding)

Verified the signatures, built and ran check-all and the integration
tests. All tests passed.

Thanks for co-ordinating the release.

- Prateek

On Mon, Dec 2, 2019 at 10:18 AM Xinyu Liu  wrote:

> + 1 (binding)
>
> Verified the signatures, built and ran the integration tests. All passed.
> There is one flaky test failure during running check-all.sh:
>
>
>
org.apache.samza.table.batching.TestBatchProcessor$TestBatchTriggered.testBatchOperationTriggeredByBatchSize
> FAILED
> java.lang.AssertionError
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertTrue(Assert.java:52)
> at
>
>
org.apache.samza.table.batching.TestBatchProcessor$TestBatchTriggered.testBatchOperationTriggeredByBatchSize(TestBatchProcessor.java:122)
>
> This shouldn't block the release as the test is flaky. We should either
fix
> or disable this test for the future releases. Create ticket to track:
> https://issues.apache.org/jira/browse/SAMZA-2411
>
> Thanks,
> Xinyu
>
>
>
> On Sun, Dec 1, 2019 at 6:20 PM Yi Pan  wrote:
>
> > +1 (binding), verified the signature, built and local integration tests
> > passed.
> >
> > Thanks!
> >
> > -Yi
> >
> > On Wed, Nov 27, 2019 at 2:49 PM Hai Lu  wrote:
> >
> > > Hi,
> > >
> > > This is a call for a vote on a release of Apache Samza 1.3.0. Thanks
to
> > > everyone who has contributed to this release.
> > >
> > > The release candidate can be downloaded from here:
> > > http://home.apache.org/~lhaiesp/samza-1.3.0-rc2/
> > >
> > > The release candidate is signed with pgp key 0x07678C76, which can be
> > found
> > > here:
> > >
> > >
> >
>
https://keyserver.ubuntu.com/pks/lookup?search=0x07678C76=on=index
> > > or to directly see the public key here:
> > >
> > >
> >
>
https://keyserver.ubuntu.com/pks/lookup?op=get=0x1513eaedf69d7ca77ff283b534ea3ca507678c76
> > >
> > > The git tag is release-1.3.0-rc2 and signed with the same pgp key
> above:
> > >
> > >
> >
>
https://gitbox.apache.org/repos/asf?p=samza.git;a=commit;h=573ef951dd9d96d9d547db86bbc8023557714f47
> > >
> > > Test binaries have been published to Maven's staging repository, and
> are
> > > available here:
> > > https://repository.apache.org/content/repositories/orgapachesamza-1073
> > >
> > > The vote will be open for 171 hours (ending at 6:00 PM PST Wednesday,
> > > 12/4/2019).
> > >
> > > Please download the release candidate, check the hashes/signature,
> build
> > it
> > > and test it, and then please vote:
> > >
> > > [ ] +1 approve
> > >
> > > [ ] +0 no opinion
> > >
> > > [ ] -1 disapprove (and reason why)
> > >
> > > I ran check-all.sh and integration tests (both YARN and standalone).
> > >
> > > +1 (non-binding) from my side.
> > >
> > > Thanks,
> > > Hai
> > >
> >
>


Re: [VOTE] Apache Samza 1.3.0 RC2

2019-12-04 Thread Hai Lu
Thanks everyone. The vote passed.


On Wed, Dec 4, 2019 at 3:25 PM Prateek Maheshwari 
wrote:

> + 1 (binding)
>
> Verified the signatures, built and ran check-all and the integration
> tests. All tests passed.
>
> Thanks for co-ordinating the release.
>
> - Prateek
>
> On Mon, Dec 2, 2019 at 10:18 AM Xinyu Liu  wrote:
>
> > + 1 (binding)
> >
> > Verified the signatures, built and ran the integration tests. All passed.
> > There is one flaky test failure during running check-all.sh:
> >
> >
> >
> org.apache.samza.table.batching.TestBatchProcessor$TestBatchTriggered.testBatchOperationTriggeredByBatchSize
> > FAILED
> > java.lang.AssertionError
> > at org.junit.Assert.fail(Assert.java:86)
> > at org.junit.Assert.assertTrue(Assert.java:41)
> > at org.junit.Assert.assertTrue(Assert.java:52)
> > at
> >
> >
> org.apache.samza.table.batching.TestBatchProcessor$TestBatchTriggered.testBatchOperationTriggeredByBatchSize(TestBatchProcessor.java:122)
> >
> > This shouldn't block the release as the test is flaky. We should either
> fix
> > or disable this test for the future releases. Create ticket to track:
> > https://issues.apache.org/jira/browse/SAMZA-2411
> >
> > Thanks,
> > Xinyu
> >
> >
> >
> > On Sun, Dec 1, 2019 at 6:20 PM Yi Pan  wrote:
> >
> > > +1 (binding), verified the signature, built and local integration tests
> > > passed.
> > >
> > > Thanks!
> > >
> > > -Yi
> > >
> > > On Wed, Nov 27, 2019 at 2:49 PM Hai Lu  wrote:
> > >
> > > > Hi,
> > > >
> > > > This is a call for a vote on a release of Apache Samza 1.3.0. Thanks
> to
> > > > everyone who has contributed to this release.
> > > >
> > > > The release candidate can be downloaded from here:
> > > > http://home.apache.org/~lhaiesp/samza-1.3.0-rc2/
> > > >
> > > > The release candidate is signed with pgp key 0x07678C76, which can be
> > > found
> > > > here:
> > > >
> > > >
> > >
> >
> https://keyserver.ubuntu.com/pks/lookup?search=0x07678C76=on=index
> > > > or to directly see the public key here:
> > > >
> > > >
> > >
> >
> https://keyserver.ubuntu.com/pks/lookup?op=get=0x1513eaedf69d7ca77ff283b534ea3ca507678c76
> > > >
> > > > The git tag is release-1.3.0-rc2 and signed with the same pgp key
> > above:
> > > >
> > > >
> > >
> >
> https://gitbox.apache.org/repos/asf?p=samza.git;a=commit;h=573ef951dd9d96d9d547db86bbc8023557714f47
> > > >
> > > > Test binaries have been published to Maven's staging repository, and
> > are
> > > > available here:
> > > >
> https://repository.apache.org/content/repositories/orgapachesamza-1073
> > > >
> > > > The vote will be open for 171 hours (ending at 6:00 PM PST Wednesday,
> > > > 12/4/2019).
> > > >
> > > > Please download the release candidate, check the hashes/signature,
> > build
> > > it
> > > > and test it, and then please vote:
> > > >
> > > > [ ] +1 approve
> > > >
> > > > [ ] +0 no opinion
> > > >
> > > > [ ] -1 disapprove (and reason why)
> > > >
> > > > I ran check-all.sh and integration tests (both YARN and standalone).
> > > >
> > > > +1 (non-binding) from my side.
> > > >
> > > > Thanks,
> > > > Hai
> > > >
> > >
> >
>


[CANCEL] [VOTE] Apache Samza 1.3.0 RC1

2019-11-27 Thread Hai Lu
See details below.

-- Forwarded message -
From: Hai Lu 
Date: Wed, Nov 27, 2019 at 2:56 PM
Subject: Re: [VOTE] Apache Samza 1.3.0 RC1
To: 


This is canceled because issues have been detected with the integration
tests.

On Thu, Nov 21, 2019 at 1:47 PM Hai Lu  wrote:

> Hi,
>
> This is a call for a vote on a release of Apache Samza 1.3.0. Thanks to
> everyone who has contributed to this release.
>
> The release candidate can be downloaded from here:
> http://home.apache.org/~lhaiesp/samza-1.3.0-rc1/
>
> The release candidate is signed with pgp key 0x07678C76, which can be
> found here:
>
> https://keyserver.ubuntu.com/pks/lookup?search=0x07678C76=on=index
> or to directly see the public key here:
>
> https://keyserver.ubuntu.com/pks/lookup?op=get=0x1513eaedf69d7ca77ff283b534ea3ca507678c76
>
> The git tag is release-1.3.0-rc1 and signed with the same pgp key above:
>
> https://gitbox.apache.org/repos/asf?p=samza.git;a=commit;h=44c791fefd74f20470d2669e3fea46d6d3780a4a
>
> Test binaries have been published to Maven's staging repository, and are
> available here:
> https://repository.apache.org/content/repositories/orgapachesamza-1072/
>
> The vote will be open for 100 hours (ending at 6:00 PM PST Monday,
> 11/25/2019).
>
> Please download the release candidate, check the hashes/signature, build
> it and test it, and then please vote:
>
> [ ] +1 approve
>
> [ ] +0 no opinion
>
> [ ] -1 disapprove (and reason why)
>
> I ran check-all.sh and integration tests.
>
> +1 (non-binding) from my side.
>
> Thanks,
> Hai
>


Re: [VOTE] Apache Samza 1.3.0 RC1

2019-11-27 Thread Hai Lu
This is canceled because issues have been detected with the integration
tests.

On Thu, Nov 21, 2019 at 1:47 PM Hai Lu  wrote:

> Hi,
>
> This is a call for a vote on a release of Apache Samza 1.3.0. Thanks to
> everyone who has contributed to this release.
>
> The release candidate can be downloaded from here:
> http://home.apache.org/~lhaiesp/samza-1.3.0-rc1/
>
> The release candidate is signed with pgp key 0x07678C76, which can be
> found here:
>
> https://keyserver.ubuntu.com/pks/lookup?search=0x07678C76=on=index
> or to directly see the public key here:
>
> https://keyserver.ubuntu.com/pks/lookup?op=get=0x1513eaedf69d7ca77ff283b534ea3ca507678c76
>
> The git tag is release-1.3.0-rc1 and signed with the same pgp key above:
>
> https://gitbox.apache.org/repos/asf?p=samza.git;a=commit;h=44c791fefd74f20470d2669e3fea46d6d3780a4a
>
> Test binaries have been published to Maven's staging repository, and are
> available here:
> https://repository.apache.org/content/repositories/orgapachesamza-1072/
>
> The vote will be open for 100 hours (ending at 6:00 PM PST Monday,
> 11/25/2019).
>
> Please download the release candidate, check the hashes/signature, build
> it and test it, and then please vote:
>
> [ ] +1 approve
>
> [ ] +0 no opinion
>
> [ ] -1 disapprove (and reason why)
>
> I ran check-all.sh and integration tests.
>
> +1 (non-binding) from my side.
>
> Thanks,
> Hai
>


[VOTE] Apache Samza 1.3.0 RC2

2019-11-27 Thread Hai Lu
Hi,

This is a call for a vote on a release of Apache Samza 1.3.0. Thanks to
everyone who has contributed to this release.

The release candidate can be downloaded from here:
http://home.apache.org/~lhaiesp/samza-1.3.0-rc2/

The release candidate is signed with pgp key 0x07678C76, which can be found
here:
https://keyserver.ubuntu.com/pks/lookup?search=0x07678C76=on=index
or to directly see the public key here:
https://keyserver.ubuntu.com/pks/lookup?op=get=0x1513eaedf69d7ca77ff283b534ea3ca507678c76

The git tag is release-1.3.0-rc2 and signed with the same pgp key above:
https://gitbox.apache.org/repos/asf?p=samza.git;a=commit;h=573ef951dd9d96d9d547db86bbc8023557714f47

Test binaries have been published to Maven's staging repository, and are
available here:
https://repository.apache.org/content/repositories/orgapachesamza-1073

The vote will be open for 171 hours (ending at 6:00 PM PST Wednesday,
12/4/2019).

Please download the release candidate, check the hashes/signature, build it
and test it, and then please vote:

[ ] +1 approve

[ ] +0 no opinion

[ ] -1 disapprove (and reason why)

I ran check-all.sh and integration tests (both YARN and standalone).

+1 (non-binding) from my side.

Thanks,
Hai


[VOTE] Apache Samza 1.3.0 RC1

2019-11-21 Thread Hai Lu
Hi,

This is a call for a vote on a release of Apache Samza 1.3.0. Thanks to
everyone who has contributed to this release.

The release candidate can be downloaded from here:
http://home.apache.org/~lhaiesp/samza-1.3.0-rc1/

The release candidate is signed with pgp key 0x07678C76, which can be found
here:
https://keyserver.ubuntu.com/pks/lookup?search=0x07678C76=on=index
or to directly see the public key here:
https://keyserver.ubuntu.com/pks/lookup?op=get=0x1513eaedf69d7ca77ff283b534ea3ca507678c76

The git tag is release-1.3.0-rc1 and signed with the same pgp key above:
https://gitbox.apache.org/repos/asf?p=samza.git;a=commit;h=44c791fefd74f20470d2669e3fea46d6d3780a4a

Test binaries have been published to Maven's staging repository, and are
available here:
https://repository.apache.org/content/repositories/orgapachesamza-1072/

The vote will be open for 100 hours (ending at 6:00 PM PST Monday,
11/25/2019).

Please download the release candidate, check the hashes/signature, build it
and test it, and then please vote:

[ ] +1 approve

[ ] +0 no opinion

[ ] -1 disapprove (and reason why)

I ran check-all.sh and integration tests.

+1 (non-binding) from my side.

Thanks,
Hai


Re: [VOTE] Apache Samza 1.3.0 RC0

2019-11-13 Thread Hai Lu
Canceling the RC as there are requests for additional cherry-picks. Please
also help mark the jira ticket as "fixed version: 1.3.0" if need to be
included in this release.

Thanks,
Hai

On Tue, Nov 12, 2019 at 5:12 PM Bharath Kumara Subramanian <
codin.mart...@gmail.com> wrote:

> -1
>
> Unfortunately, we discovered a bug in side inputs for standby containers.
> Here is the ticket that tracks the issue -
> https://issues.apache.org/jira/browse/SAMZA-2382
>
> Can we pull https://github.com/apache/samza/pull/1218 into the release as
> well?
>
> Thanks,
> Bharath
>
> On Tue, Nov 12, 2019 at 3:41 PM Hai Lu  wrote:
>
> > Hi,
> >
> > This is a call for a vote on a release of Apache Samza 1.3.0. Thanks to
> > everyone who has contributed to this release.
> >
> > The release candidate can be downloaded from here:
> > http://home.apache.org/~lhaiesp/samza-1.3.0-rc0/
> >
> > The release candidate is signed with pgp key 0x07678C76, which can be
> found
> > here:
> >
> >
> https://keyserver.ubuntu.com/pks/lookup?search=0x07678C76=on=index
> > or to directly see the public key here:
> >
> >
> https://keyserver.ubuntu.com/pks/lookup?op=get=0x1513eaedf69d7ca77ff283b534ea3ca507678c76
> >
> > The git tag is release-1.3.0-rc0 and signed with the same pgp key above:
> >
> >
> https://gitbox.apache.org/repos/asf?p=samza.git;a=tag;h=aa4bafdfe54919d2dc576c18f542d5fb7066d04b
> >
> > Test binaries have been published to Maven's staging repository, and are
> > available here:
> > https://repository.apache.org/content/repositories/orgapachesamza-1071/
> >
> > The vote will be open for 56 hours (ending at 10:00 PM PST Thursday,
> > 11/14/2019).
> >
> > Please download the release candidate, check the hashes/signature, build
> it
> > and test it, and then please vote:
> >
> > [ ] +1 approve
> >
> > [ ] +0 no opinion
> >
> > [ ] -1 disapprove (and reason why)
> >
> > I ran check-all.sh and integration tests.
> >
> > +1 (non-binding) from my side.
> >
> > Thanks,
> > Hai
> >
>


[VOTE] Apache Samza 1.3.0 RC0

2019-11-12 Thread Hai Lu
Hi,

This is a call for a vote on a release of Apache Samza 1.3.0. Thanks to
everyone who has contributed to this release.

The release candidate can be downloaded from here:
http://home.apache.org/~lhaiesp/samza-1.3.0-rc0/

The release candidate is signed with pgp key 0x07678C76, which can be found
here:
https://keyserver.ubuntu.com/pks/lookup?search=0x07678C76=on=index
or to directly see the public key here:
https://keyserver.ubuntu.com/pks/lookup?op=get=0x1513eaedf69d7ca77ff283b534ea3ca507678c76

The git tag is release-1.3.0-rc0 and signed with the same pgp key above:
https://gitbox.apache.org/repos/asf?p=samza.git;a=tag;h=aa4bafdfe54919d2dc576c18f542d5fb7066d04b

Test binaries have been published to Maven's staging repository, and are
available here:
https://repository.apache.org/content/repositories/orgapachesamza-1071/

The vote will be open for 56 hours (ending at 10:00 PM PST Thursday,
11/14/2019).

Please download the release candidate, check the hashes/signature, build it
and test it, and then please vote:

[ ] +1 approve

[ ] +0 no opinion

[ ] -1 disapprove (and reason why)

I ran check-all.sh and integration tests.

+1 (non-binding) from my side.

Thanks,
Hai


[DISCUSS] Samza 1.3 release

2019-11-06 Thread Hai Lu
Hi all,

It's been some time since our last release and we have accumulated some
good features/improvements which call for a new 1.3 release.

Want to kick off the discussion in the open source forum while we have
already tested many of these changes internally at LinkedIn.

As a quick highlight of the release, it includes many improvements on Samza
SQL, table API, etc. As well as a critical at least once bug fix for
aggregation.

A comprehensive list of changes (jira tickets) can be found here:
https://issues.apache.org/jira/browse/SAMZA-2356?jql=project%20%3D%20%22SAMZA%22%20and%20fixVersion%20in%20(1.3)

The list was compiled based on the committed log on the already created 1.3.0
branch 

*To everyone, any potential changes/patches that need to be cherry-picked
into 1.3.0 branch, please remember to put 1.3 as the "fix version" in the
jira ticket.*

I will move forward to generate the binary and target a release vote later
this week or early next week, if everyone agrees on the plan for release.

Thanks,
Hai


Re: [ANNOUNCE] Please welcome Boris Shkolnik to the Samza PMC

2019-06-08 Thread Hai Lu
Congratulations, Boris!

On Fri, Jun 7, 2019 at 6:13 PM Aditya  wrote:

> Congrats Boris!
>
> > On Jun 7, 2019, at 4:58 PM, Weiqing Yang 
> wrote:
> >
> > Congrats, Boris!
> >
> > On Fri, Jun 7, 2019 at 4:50 PM santhosh venkat <
> santhoshvenkat1...@gmail.com>
> > wrote:
> >
> >> Congratulations boris! Very well deserved.
> >>
> >> On Fri, Jun 7, 2019 at 3:41 PM Daniel Nishimura 
> >> wrote:
> >>
> >>> Congrats!
> >>>
>  On Fri, Jun 7, 2019 at 3:35 PM Ignacio Solis  wrote:
> 
>  Congrats Boris!
> 
>  On Fri, Jun 7, 2019 at 3:20 PM Bharath Kumara Subramanian <
>  codin.mart...@gmail.com> wrote:
> 
> > Congratulations Boris!
> >
> > On Fri, Jun 7, 2019 at 3:19 PM Jagadish Venkatraman <
> > jagadish1...@gmail.com>
> > wrote:
> >
> >> Congratulations Boris!
> >>
> >> On Fri, Jun 7, 2019 at 3:15 PM Xinyu Liu 
>  wrote:
> >>
> >>> Congrats, Boris!
> >>>
> >>> Xinyu
> >>>
> >>> On Fri, Jun 7, 2019 at 3:13 PM Jakob Homan 
>  wrote:
> >>>
>  Howdy all-
>    I'm very pleased to announce that the Samza PMC has voted
> >>> Boris
>  Shkolnik to be a Project Management Committee (PMC) Member.
> >> The
>  PMC
>  is responsible for the overall health of a project andl for
> >>> voting
>  in
>  new committers and PMC members, as well as VOTEing on releases.
>  Over
>  the past two years, Boris has been a valuable committer on the
>  project.
> 
>  Congrats Boris!
> 
>  Thanks,
> 
>  Jakob
>  on behalf of the Samza PMC
> 
> >>>
> >>
> >>
> >> --
> >> Jagadish V,
> >> Graduate Student,
> >> Department of Computer Science,
> >> Stanford University
> >>
> >
> 
> 
>  --
>  Nacho - Ignacio Solis - iso...@igso.net
> 
> >>>
> >>
>


Re: REMINDER. [VOTE] Apache Samza 1.2.0 RC4

2019-06-04 Thread Hai Lu
+1 (non-binding)

Verified build and test on Linux box. On mac the test is failing but seems
like flakiness not real failure.

Thanks,
Hai

On Tue, Jun 4, 2019 at 1:55 PM santhosh venkat 
wrote:

> +1(non-binding)
>
> 1. ./bin/check-all.sh succeeded
> 2. ./bin/integration-tests.sh succeeded
> 3. Expanded samza-tools and followed the tutorial steps for standalone SQL
> examples Succeeded.
> 4. Verified all sha1 hash code and asc signatures successfully
>
> Thanks,
>
>
> On Tue, Jun 4, 2019 at 1:26 PM Xinyu Liu  wrote:
>
> > +1 (binding).
> >
> > run check-all.sh and the build passed.
> >
> > Having trouble running the integration tests in both linux and mac,
> > possibly due to my local machine env.
> >
> > Thanks,
> > Xinyu
> >
> > On Mon, Jun 3, 2019 at 11:00 AM Daniel Nishimura 
> > wrote:
> >
> > > check-all.sh and integration tests passed. +1 from me.
> > >
> > > Just a side note, the link in the original email is a broken link. The
> > link
> > > to the RC archive is: http://home.apache.org/~boryas/samza-1.2.0-rc4
> > >
> > > On Sun, Jun 2, 2019 at 5:00 PM Boris Shkolnik 
> wrote:
> > >
> > > > Hi,
> > > >
> > > > This is a call for a vote on a release of Apache Samza 1.2.0. Thanks
> to
> > > > everyone who has contributed to this release.
> > > >
> > > >
> > > > The release candidate can be downloaded from here:
> > > > http://home.apache.org/~boryas/samza-1.2.0-rc
> > > > 4
> > > >
> > > > (this release has a fix for standalone integration test)
> > > >
> > > > The release candidate is signed with pgp key 0x7D74D0CD5B5EB041,
> which
> > > can
> > > > be found
> > > >
> > http://keyserver.ubuntu.com/pks/lookup?op=get=0x7d74d0cd5b5eb041
> > > > <
> > http://keyserver.ubuntu.com/pks/lookup?op=get=0xF8B95961A401BF0F
> > > >
> > > > The git tag is release-1.2.0-rc4 and signed with the same pgp key:
> > > >
> > > >
> > >
> >
> https://gitbox.apache.org/repos/asf?p=samza.git;a=tag;h=refs/tags/release-1.2.0-rc
> > > > <
> > > >
> > >
> >
> https://gitbox.apache.org/repos/asf?p=samza.git;a=tag;h=refs/tags/release-1.1.0-rc1
> > > > >
> > > > 4
> > > >
> > > > Test binaries have been published to Maven's staging repository, and
> > are
> > > > available here:
> > > >
> https://repository.apache.org/content/repositories/orgapachesamza-106
> > > > <
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachesamza-1065/org/
> > > > >
> > > > 9
> > > >
> > > > The vote will be open until 06:00 PM PST Monday, 06/03/2019.
> > > >
> > > >
> > > > Please download the release candidate, check the hashes/signature,
> > build
> > > it
> > > > and test it, and then please vote:
> > > >
> > > > [ ] +1 approve
> > > >
> > > > [ ] +0 no opinion
> > > >
> > > > [ ] -1 disapprove (and reason why)
> > > >
> > > > I ran check-all.sh and integration tests.
> > > >
> > > > +1 from my side.
> > > >
> > > > Thanks
> > > >
> > >
> >
>


Re: Samza 1.1.0 on AWS EMR (emr - 5.13.0, amazon 2.8.3, zookeeper 3.4.10)

2019-04-18 Thread Hai Lu
+samza dev

ApplicationRunnerMain should be there in the samza-core module. Are you
seeing the samza-core jar in your lib folder? Make sure the scala version
also match (2.10 vs. 2.11)

Are you upgrading from 0.14 to 1.1 or from 1.0 to 1.1?

Thanks,
Hai


On Wed, Apr 17, 2019 at 4:28 PM Majd F. Sakr  wrote:

> Hello Hai,
>
> I hope you are doing well.  I'm writing you from Carnegie Mellon
> University again.
>
> One of my TAs is trying to upgrade our project to use the latest Samza
> distribution but is failing.  He describes the situation here:
>
> https://stackoverflow.com/questions/55737123/samza-1-1-0-run-app-sh-does-not-work-during-deployment-of-hello-samza
>
> Can someone on your team please help?
>
> Many thanks,
>
> Majd
>
>


Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-27 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/
---

(Updated Jan. 28, 2017, 6:15 a.m.)


Review request for samza.


Bugs: SAMZA-1025
https://issues.apache.org/jira/browse/SAMZA-1025


Repository: samza


Description
---

documentation for hdfs system consumer


Diffs (updated)
-

  docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
  docs/learn/documentation/versioned/hdfs/producer.md 
b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
  docs/learn/documentation/versioned/index.html 
d0b14ece94341e2cb937cf32db480e69f93303c2 
  docs/learn/documentation/versioned/jobs/configuration-table.html 
ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 

Diff: https://reviews.apache.org/r/52570/diff/


Testing
---

N/A


Thanks,

Hai Lu



Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-27 Thread Hai Lu


> On Jan. 27, 2017, 7:03 a.m., Jagadish Venkatraman wrote:
> > This is looking pretty good. Thank you for the effort in writing the docs!

Thank you for taking the time to review it. Really appreciate it:)


> On Jan. 27, 2017, 7:03 a.m., Jagadish Venkatraman wrote:
> > docs/learn/documentation/versioned/hdfs/consumer.md, line 30
> > <https://reviews.apache.org/r/52570/diff/4/?file=1617159#file1617159line30>
> >
> > `to process these partitions` appears twice? Copy-paste error?

Oops, sorry for such a reckless mistake.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/#review163234
-------


On Jan. 27, 2017, 5:48 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52570/
> ---
> 
> (Updated Jan. 27, 2017, 5:48 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-1025
> https://issues.apache.org/jira/browse/SAMZA-1025
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> documentation for hdfs system consumer
> 
> 
> Diffs
> -
> 
>   docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
>   docs/learn/documentation/versioned/hdfs/producer.md 
> b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
>   docs/learn/documentation/versioned/index.html 
> d0b14ece94341e2cb937cf32db480e69f93303c2 
>   docs/learn/documentation/versioned/jobs/configuration-table.html 
> ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 
> 
> Diff: https://reviews.apache.org/r/52570/diff/
> 
> 
> Testing
> ---
> 
> N/A
> 
> 
> Thanks,
> 
> Hai Lu
> 
>



Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-27 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/
---

(Updated Jan. 27, 2017, 5:48 p.m.)


Review request for samza.


Bugs: SAMZA-1025
https://issues.apache.org/jira/browse/SAMZA-1025


Repository: samza


Description
---

documentation for hdfs system consumer


Diffs (updated)
-

  docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
  docs/learn/documentation/versioned/hdfs/producer.md 
b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
  docs/learn/documentation/versioned/index.html 
d0b14ece94341e2cb937cf32db480e69f93303c2 
  docs/learn/documentation/versioned/jobs/configuration-table.html 
ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 

Diff: https://reviews.apache.org/r/52570/diff/


Testing
---

N/A


Thanks,

Hai Lu



Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-26 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/
---

(Updated Jan. 27, 2017, 12:03 a.m.)


Review request for samza.


Bugs: SAMZA-1025
https://issues.apache.org/jira/browse/SAMZA-1025


Repository: samza


Description
---

documentation for hdfs system consumer


Diffs (updated)
-

  docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
  docs/learn/documentation/versioned/hdfs/producer.md 
b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
  docs/learn/documentation/versioned/index.html 
d0b14ece94341e2cb937cf32db480e69f93303c2 
  docs/learn/documentation/versioned/jobs/configuration-table.html 
ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 

Diff: https://reviews.apache.org/r/52570/diff/


Testing
---

N/A


Thanks,

Hai Lu



Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-26 Thread Hai Lu


> On Jan. 26, 2017, 10:49 p.m., Jagadish Venkatraman wrote:
> > docs/learn/documentation/versioned/hdfs/consumer.md, line 92
> > <https://reviews.apache.org/r/52570/diff/3/?file=1616638#file1616638line92>
> >
> > nit: Use capitalizations consistently
> > 
> > 1. Is `id` of any significance here?

Yes. it has to be exactly "[id]"


> On Jan. 26, 2017, 10:49 p.m., Jagadish Venkatraman wrote:
> > docs/learn/documentation/versioned/jobs/configuration-table.html, line 1832
> > <https://reviews.apache.org/r/52570/diff/3/?file=1616641#file1616641line1832>
> >
> > Do we allow any other `partitioner` other than the defaultPartitioner? 
> > If not, it feels slightly unconventional to have `default` in the property 
> > name?

This was discussed in the code review for implementation. And yes, we do allow 
other type of partitioner potentially in the future. That's why I was suggested 
to put "defaultPartitioner" here.


> On Jan. 26, 2017, 10:49 p.m., Jagadish Venkatraman wrote:
> > docs/learn/documentation/versioned/jobs/configuration-table.html, line 1839
> > <https://reviews.apache.org/r/52570/diff/3/?file=1616641#file1616641line1839>
> >
> > Is the group identifier the partition name? 
> > Is `id` a reserved term? 
> > Do we always expect `[id]`?

yes, it's a reserved term. updated the doc.


> On Jan. 26, 2017, 10:49 p.m., Jagadish Venkatraman wrote:
> > docs/learn/documentation/versioned/jobs/configuration-table.html, line 1847
> > <https://reviews.apache.org/r/52570/diff/3/?file=1616641#file1616641line1847>
> >
> > The other minor observation is that the camel case naming scheme is 
> > slightly inconsistent with what we have for our public configs.
> > 
> > For instance, job.coordinator.
> > monitor-partition-change follows a different convention. I'd really 
> > prefer consistency in our config scheme. 
> > 
> > Navina or Prateek thoughts?

I'm not sure what kind of discussion was going on there. But when I implemented 
the codes, I was explicitly told to adopt such a style (camel case).


> On Jan. 26, 2017, 10:49 p.m., Jagadish Venkatraman wrote:
> > docs/learn/documentation/versioned/jobs/configuration-table.html, line 1842
> > <https://reviews.apache.org/r/52570/diff/3/?file=1616641#file1616641line1842>
> >
> > Is this the name of the class? I'm still not clear how this is used?

It's literally "avro", or "plain", or "json". Though the last two are not 
supported now. No, they are not class name.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/#review163177
---


On Jan. 26, 2017, 6:47 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52570/
> ---
> 
> (Updated Jan. 26, 2017, 6:47 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-1025
> https://issues.apache.org/jira/browse/SAMZA-1025
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> documentation for hdfs system consumer
> 
> 
> Diffs
> -
> 
>   docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
>   docs/learn/documentation/versioned/hdfs/producer.md 
> b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
>   docs/learn/documentation/versioned/index.html 
> d0b14ece94341e2cb937cf32db480e69f93303c2 
>   docs/learn/documentation/versioned/jobs/configuration-table.html 
> ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 
> 
> Diff: https://reviews.apache.org/r/52570/diff/
> 
> 
> Testing
> ---
> 
> N/A
> 
> 
> Thanks,
> 
> Hai Lu
> 
>



Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-26 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/
---

(Updated Jan. 26, 2017, 6:47 p.m.)


Review request for samza.


Bugs: SAMZA-1025
https://issues.apache.org/jira/browse/SAMZA-1025


Repository: samza


Description
---

documentation for hdfs system consumer


Diffs (updated)
-

  docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
  docs/learn/documentation/versioned/hdfs/producer.md 
b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
  docs/learn/documentation/versioned/index.html 
d0b14ece94341e2cb937cf32db480e69f93303c2 
  docs/learn/documentation/versioned/jobs/configuration-table.html 
ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 

Diff: https://reviews.apache.org/r/52570/diff/


Testing
---

N/A


Thanks,

Hai Lu



Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-26 Thread Hai Lu


> On Jan. 25, 2017, 10:50 p.m., Navina Ramesh wrote:
> > docs/learn/documentation/versioned/hdfs/consumer.md, line 26
> > <https://reviews.apache.org/r/52570/diff/2/?file=1613256#file1613256line26>
> >
> > Can you include the diagram from your design document?  Or something 
> > similar to elaborate how the setup should look like?

The diagram was mostly for the situation at LinkedIn where we have separte yarn 
clusters - one for Samza, one for Hadoop. "Your job needs to run on the same 
YARN cluster which hosts the HDFS you want to consume from."  Is this statement 
not clear enough? What suggestion do you have in terms of the wording?


> On Jan. 25, 2017, 10:50 p.m., Navina Ramesh wrote:
> > docs/learn/documentation/versioned/hdfs/consumer.md, line 42
> > <https://reviews.apache.org/r/52570/diff/2/?file=1613256#file1613256line42>
> >
> > Can you add a snippet of the interface here as well? It is easier for 
> > the user to skim through.

Linked to the java doc.


> On Jan. 25, 2017, 10:50 p.m., Navina Ramesh wrote:
> > docs/learn/documentation/versioned/hdfs/consumer.md, line 50
> > <https://reviews.apache.org/r/52570/diff/2/?file=1613256#file1613256line50>
> >
> > Replace "users" with "user application". 
> > 
> > We provide the capability for the user application to get notified when 
> > ...
> > 
> > Rephrase "To do so, simply implement the interface 
> > [EndOfStreamListenerTask]" as "In order to receieve notications on 
> > EndOfStream with the task, the user application should simply implement the 
> > interface ..."

I changed the wording as Jagadish suggested above. Let me know if you have 
further suggestion on top of that.


- Hai


-----------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/#review163036
---


On Jan. 24, 2017, 2:07 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52570/
> ---
> 
> (Updated Jan. 24, 2017, 2:07 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-1025
> https://issues.apache.org/jira/browse/SAMZA-1025
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> documentation for hdfs system consumer
> 
> 
> Diffs
> -
> 
>   docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
>   docs/learn/documentation/versioned/hdfs/producer.md 
> b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
>   docs/learn/documentation/versioned/index.html 
> d0b14ece94341e2cb937cf32db480e69f93303c2 
>   docs/learn/documentation/versioned/jobs/configuration-table.html 
> ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 
> 
> Diff: https://reviews.apache.org/r/52570/diff/
> 
> 
> Testing
> ---
> 
> N/A
> 
> 
> Thanks,
> 
> Hai Lu
> 
>



Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-26 Thread Hai Lu


> On Jan. 25, 2017, 10:36 p.m., Jagadish Venkatraman wrote:
> > docs/learn/documentation/versioned/hdfs/consumer.md, line 67
> > <https://reviews.apache.org/r/52570/diff/2/?file=1613256#file1613256line67>
> >
> > The relationship between whitelist and blacklist was not very obvious 
> > to me.
> > 
> > Is the behavior that the whitelist is applied first, and the blacklist 
> > is applied to the matched files later? (to determine which files are to be 
> > ignored).

The order doesn't matter. (X & whitelist) - blacklist == (X - blacklist) & 
whitelist


> On Jan. 25, 2017, 10:36 p.m., Jagadish Venkatraman wrote:
> > docs/learn/documentation/versioned/hdfs/consumer.md, line 97
> > <https://reviews.apache.org/r/52570/diff/2/?file=1613256#file1613256line97>
> >
> > Not clear to me how this differs from the whitelist (*.avro which 
> > specifies what files to process).

They completely different: reader is the type of data encoded in the file 
content; whitelist is used to filter based on file name. Technically you can 
have an avro file that has ".java" as it's extention, right?


> On Jan. 25, 2017, 10:36 p.m., Jagadish Venkatraman wrote:
> > docs/learn/documentation/versioned/jobs/configuration-table.html, line 1819
> > <https://reviews.apache.org/r/52570/diff/2/?file=1613259#file1613259line1819>
> >
> > What's this configuration? Is this the number of messages? What are the 
> > implications of this? 
> > 
> > I'm not in favor of exposing this tunable if this is not 
> > super-significant.

This is important for performance tuning in some cases. Added a bit more 
details to explain


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/#review163011
---


On Jan. 24, 2017, 2:07 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52570/
> ---
> 
> (Updated Jan. 24, 2017, 2:07 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Bugs: SAMZA-1025
> https://issues.apache.org/jira/browse/SAMZA-1025
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> documentation for hdfs system consumer
> 
> 
> Diffs
> -
> 
>   docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
>   docs/learn/documentation/versioned/hdfs/producer.md 
> b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
>   docs/learn/documentation/versioned/index.html 
> d0b14ece94341e2cb937cf32db480e69f93303c2 
>   docs/learn/documentation/versioned/jobs/configuration-table.html 
> ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 
> 
> Diff: https://reviews.apache.org/r/52570/diff/
> 
> 
> Testing
> ---
> 
> N/A
> 
> 
> Thanks,
> 
> Hai Lu
> 
>



Re: Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2017-01-23 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/
---

(Updated Jan. 24, 2017, 2:07 a.m.)


Review request for samza.


Bugs: SAMZA-1025
https://issues.apache.org/jira/browse/SAMZA-1025


Repository: samza


Description (updated)
---

documentation for hdfs system consumer


Diffs (updated)
-

  docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
  docs/learn/documentation/versioned/hdfs/producer.md 
b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
  docs/learn/documentation/versioned/index.html 
d0b14ece94341e2cb937cf32db480e69f93303c2 
  docs/learn/documentation/versioned/jobs/configuration-table.html 
ba5ebbc54b5c64f82f35ed781dad7023a8f920e1 

Diff: https://reviews.apache.org/r/52570/diff/


Testing
---

N/A


Thanks,

Hai Lu



Re: How to Use samza-hdfs

2016-12-20 Thread Hai Lu
"close" as in the file is closed - either the producer moves on to the next
file or the job gets shutdown. I meant to say that this is liking writing
into disk, the data don't actually reach the disk until a flush happens. So
while you are still producing into HDFS, the content don't get
committed/flush until you are finished. That's why if you change the size
per file and make them smaller, you will probably see the previous results
earlier.

It's all just my theory, though. But it's worth a shot:)

And yeah, the stream name doesn't really make a difference as far as I know.

Thanks,
Hai

On Tue, Dec 20, 2016 at 4:40 PM, Rui Tang <tangrui...@gmail.com> wrote:

> By the way, what do you mean "close", close what?
>
> And what should the stream parameter been, like the following "default"
> one? It seems noting to do with the result.
>
> private final SystemStream OUTPUT_STREAM = new SystemStream("hdfs",
> * "default"*);
>
> On Wed, Dec 21, 2016 at 8:23 AM Rui Tang <tangrui...@gmail.com> wrote:
>
>> Thank you, I'll try it out!
>>
>> On Wed, Dec 21, 2016 at 1:45 AM Hai Lu <h...@linkedin.com> wrote:
>>
>> Hi Rui,
>>
>> I've tried out the HDFS producer, too. In my experience, you won't be
>> able to see changes written into HDFS in realtime. The content of the files
>> become visible only after they get closed.
>>
>> You can probably play with the "producer.hdfs.write.batch.size.bytes"
>> config to force rolling over to new files so you can see the results of the
>> previous one.
>>
>> Thanks,
>> Hai
>>
>> On Mon, Dec 19, 2016 at 11:29 PM, Rui Tang <tangrui...@gmail.com> wrote:
>>
>> I'm using samza-hdfs to write Kafka streams to HDFS, but I can't make it
>> work.
>>
>> Here is my samza job's properties file:
>>
>> # Job
>> job.factory.class=org.apache.samza.job.yarn.YarnJobFactory
>> job.name=kafka2hdfs
>>
>> # YARN
>> yarn.package.path=file://${basedir}/target/${project.
>> artifactId}-${pom.version}-dist.tar.gz
>>
>> # Task
>> task.class=net.tangrui.kafka2hdfs.Kafka2HDFSStreamTask
>> task.inputs=kafka.events
>>
>> # Serializers
>> serializers.registry.string.class=org.apache.samza.
>> serializers.StringSerdeFactory
>>
>> # Systems
>> systems.kafka.samza.factory=org.apache.samza.system.kafka.
>> KafkaSystemFactory
>> systems.kafka.samza.msg.serde=string
>> systems.kafka.consumer.zookeeper.connect=localhost:2181
>> systems.kafka.producer.bootstrap.servers=localhost:9092
>>
>> systems.hdfs.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory
>> systems.hdfs.producer.hdfs.writer.class=org.apache.samza.
>> system.hdfs.writer.TextSequenceFileHdfsWriter
>> systems.hdfs.producer.hdfs.base.output.dir=/events
>>
>> # Job Coordinator
>> job.coordinator.system=kafka
>> # Normally, this would be 3, but we have only one broker.
>> job.coordinator.replication.factor=1
>>
>>
>> Here is my simple task:
>>
>> public class Kafka2HDFSStreamTask implements StreamTask {
>>
>>   private final SystemStream OUTPUT_STREAM = new SystemStream("hdfs",*
>> "default"*);
>>
>>
>>
>>   @Override
>>   public void process(IncomingMessageEnvelope incomingMessageEnvelope,
>>   MessageCollector messageCollector, TaskCoordinator
>> taskCoordinator) throws Exception {
>> String message = (String) incomingMessageEnvelope.getMessage();
>> OutgoingMessageEnvelope envelope = new
>> OutgoingMessageEnvelope(OUTPUT_STREAM, message);
>> messageCollector.send(envelope);
>>   }
>> }
>>
>> When running this job, a sequence file will be created in HDFS, but only
>> has some header info, no content. I cannot figure out where is wrong. And
>> what should I provide with the "stream" parameter when building the
>> SystemStream instance.
>>
>>


Re: How to Use samza-hdfs

2016-12-20 Thread Hai Lu
Hi Rui,

I've tried out the HDFS producer, too. In my experience, you won't be able
to see changes written into HDFS in realtime. The content of the files
become visible only after they get closed.

You can probably play with the "producer.hdfs.write.batch.size.bytes"
config to force rolling over to new files so you can see the results of the
previous one.

Thanks,
Hai

On Mon, Dec 19, 2016 at 11:29 PM, Rui Tang  wrote:

> I'm using samza-hdfs to write Kafka streams to HDFS, but I can't make it
> work.
>
> Here is my samza job's properties file:
>
> # Job
> job.factory.class=org.apache.samza.job.yarn.YarnJobFactory
> job.name=kafka2hdfs
>
> # YARN
> yarn.package.path=file://${basedir}/target/${project.
> artifactId}-${pom.version}-dist.tar.gz
>
> # Task
> task.class=net.tangrui.kafka2hdfs.Kafka2HDFSStreamTask
> task.inputs=kafka.events
>
> # Serializers
> serializers.registry.string.class=org.apache.samza.
> serializers.StringSerdeFactory
>
> # Systems
> systems.kafka.samza.factory=org.apache.samza.system.kafka.
> KafkaSystemFactory
> systems.kafka.samza.msg.serde=string
> systems.kafka.consumer.zookeeper.connect=localhost:2181
> systems.kafka.producer.bootstrap.servers=localhost:9092
>
> systems.hdfs.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory
> systems.hdfs.producer.hdfs.writer.class=org.apache.samza.
> system.hdfs.writer.TextSequenceFileHdfsWriter
> systems.hdfs.producer.hdfs.base.output.dir=/events
>
> # Job Coordinator
> job.coordinator.system=kafka
> # Normally, this would be 3, but we have only one broker.
> job.coordinator.replication.factor=1
>
>
> Here is my simple task:
>
> public class Kafka2HDFSStreamTask implements StreamTask {
>   private final SystemStream OUTPUT_STREAM = new SystemStream("hdfs",*
> "default"*);
>
>   @Override
>   public void process(IncomingMessageEnvelope incomingMessageEnvelope,
>   MessageCollector messageCollector, TaskCoordinator
> taskCoordinator) throws Exception {
> String message = (String) incomingMessageEnvelope.getMessage();
> OutgoingMessageEnvelope envelope = new
> OutgoingMessageEnvelope(OUTPUT_STREAM, message);
> messageCollector.send(envelope);
>   }
> }
>
> When running this job, a sequence file will be created in HDFS, but only
> has some header info, no content. I cannot figure out where is wrong. And
> what should I provide with the "stream" parameter when building the
> SystemStream instance.
>


Review Request 52570: SAMZA-1025: documentation for hdfs system consumer

2016-10-05 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52570/
---

Review request for samza.


Bugs: SAMZA-1025
https://issues.apache.org/jira/browse/SAMZA-1025


Repository: samza


Description
---

documentation of hdfs consumer


Diffs
-

  docs/learn/documentation/versioned/hdfs/consumer.md PRE-CREATION 
  docs/learn/documentation/versioned/hdfs/producer.md 
b0e936f5b0a9c945ea7f02bfc2536ef50f017bf6 
  docs/learn/documentation/versioned/jobs/configuration-table.html 
f60cd50fb197423ac3c84fd364bbe4fb3767883e 

Diff: https://reviews.apache.org/r/52570/diff/


Testing
---

N/A


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-05 Thread Hai Lu
  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
 |  
  
 
+---+--+
 
 |  
| 
 |  HDFSSystemFactory   
| 
 |  
| 
 
+--+


Diffs (updated)
-

  build.gradle 2bea27b75288d3103178bc3762b9556f6e69cdd1 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/YarnContainerRunner.java 
dacc52de0a34498a715a299bc69c95aebd1b08ba 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-04 Thread Hai Lu


> On Sept. 29, 2016, 10:02 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 197
> > <https://reviews.apache.org/r/51142/diff/5-7/?file=1493810#file1493810line197>
> >
> > Thinking of this more, I would prefer less dependency imposed between 
> > samza-yarn and samza-hdfs modules. Thinking of a case where HDFS consumer 
> > is used by a standalone Samza job, there is no YarnConfig object in the 
> > job. I think we should make this as required config for HdfsSystemConsumer, 
> > just like ZooKeeper connnect string is required for KafkaSystemConsumer.
> > 
> > Also, under which condition we need to clear the partition descriptor 
> > info in the staging dir? We need to think about the cleanup procedure as 
> > well.
> 
> Hai Lu wrote:
> We need to remove partition descriptors when job is done. Not doing so 
> would end up spamming user's HDFS space, causing immediate troubles to our 
> users. 
> 
> But right now there is no way that HdfsSystemConsumer/Admin would know 
> when the job is shutdown. So I don't see there is a solution if we don't 
> directly/indirectly depend on YARN, since only the YARN codes have this idea 
> of staging directory, and actually clean up the directory at the end of the 
> job.  I think what we really need to do, long term, is to support staging 
> direcotry in the Samza level, so that in addition to YARN, other platforms 
> like Docker, Mesos, Standalone can work as well.
> 
> Plus we have to keep in mind that only YARN has the kerberos support for 
> now. So currently HDFS systems ARE depending on YARN in that sense. Security 
> is one more thing to deal with (aside from staging directory) before we can 
> say HDFS sytems no long depends on YARN.
> 
> What do you think? I will keep this issue open.
> 
> Yi Pan (Data Infrastructure) wrote:
> There are two different levels of dependencies here: a) code-level 
> dependency that means the HdfsSystemConsumer code depends directly on 
> samza-yarn classes; b) config/semantic dependency that means some expected 
> behavior of a certain function (i.e. cleanup) depends on other modules. I 
> would prefer to remove the code-level dependency from the beginning. We can 
> still set the configuration of HdfsSystemConsumer to use the same staging 
> directory configuration from samza-yarn to achieve the cleanup function. This 
> means that HdfsSystemConsumer itself does not support after-completion 
> cleanup yet and depends on samza-yarn to clean up. It is a configure-level 
> dependency and we have the freedom to remove this w/o code change when either 
> a) HdfsSystemConsumer can cleanup the staging directory after end-of-stream; 
> b) staging directory config is moved to samza-core. Thoughts?

Discussed offline. Given that HDFS system consumer right now has to run on 
YARN, it's OK to explictly have the code dependency for now. Opened SAMZA-1032 
to track the work to move staging directory out of the YARN context.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150949
---


On Oct. 5, 2016, 12:02 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Oct. 5, 2016, 12:02 a.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then p

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-04 Thread Hai Lu
  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
 |  
  
 
+---+--+
 
 |  
| 
 |  HDFSSystemFactory   
| 
 |  
| 
 
+--+


Diffs (updated)
-

  build.gradle 2bea27b75288d3103178bc3762b9556f6e69cdd1 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/YarnContainerRunner.java 
dacc52de0a34498a715a299bc69c95aebd1b08ba 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-03 Thread Hai Lu
  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
 |  
  
 
+---+--+
 
 |  
| 
 |  HDFSSystemFactory   
| 
 |  
| 
 
+--+


Diffs (updated)
-

  build.gradle 2bea27b75288d3103178bc3762b9556f6e69cdd1 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/YarnContainerRunner.java 
dacc52de0a34498a715a299bc69c95aebd1b08ba 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-03 Thread Hai Lu
  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
 |  
  
 
+---+--+
 
 |  
| 
 |  HDFSSystemFactory   
| 
 |  
| 
 
+--+


Diffs (updated)
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/YarnContainerRunner.java 
dacc52de0a34498a715a299bc69c95aebd1b08ba 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-02 Thread Hai Lu


> On Sept. 29, 2016, 10:06 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-shell/src/main/bash/run-job-for-azkaban.sh, line 1
> > <https://reviews.apache.org/r/51142/diff/7/?file=1513511#file1513511line1>
> >
> > Question: why do we need this in open source? Don't we already have a 
> > run-job.sh in open source that is general for any YARN application?

Removed from open source.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150953
-------


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
>  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-29 Thread Hai Lu


> On Sept. 29, 2016, 5:56 p.m., Prateek Maheshwari wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 66
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493810#file1493810line66>
> >
> > "systems.%s.consumer.buffer-capacity" makes sense to me. Regarding the 
> > "hdfs" prefix, there's already inconsistency in current configs. The kafka 
> > system configs don't include kafka in the config name, but the hdfs 
> > producer configs do. The kafka convention is better IMHO.
> > 
> > Either way, we should at least be consistent between this and the new 
> > partitioner/reader configs which don't have the hdfs prefix.
> 
> Prateek Maheshwari wrote:
> Btw, in the new configs we'll be using camelCase instead of dashes, so 
> we'll eventually need to change it it to bufferCapacity.

There won't be any problems if I change it to camelCase style now, right? Or 
should I keep the dash?


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150883
---


On Sept. 28, 2016, 9:57 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 28, 2016, 9:57 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
>

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-28 Thread Hai Lu
  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
 |  
  
 
+---+--+
 
 |  
| 
 |  HDFSSystemFactory   
| 
 |  
| 
 
+--+


Diffs (updated)
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-shell/src/main/bash/run-job-for-azkaban.sh PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/YarnContainerRunner.java 
dacc52de0a34498a715a299bc69c95aebd1b08ba 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-28 Thread Hai Lu


> On Sept. 28, 2016, 12:28 a.m., Navina Ramesh wrote:
> > build.gradle, line 308
> > <https://reviews.apache.org/r/51142/diff/6/?file=1506732#file1506732line308>
> >
> > why is this dependency needed here? It seems like this compile 
> > dependency is required for samza-hdfs and now samza-hdfs depends on 
> > samza-yarn. Is there a better way to do this?

It comes from the fact that HDFSConfig has a dependency on YarnConfig. It's not 
a hard dependency as we can copy-paste the config key name. So far we don't 
have a cleaner solution.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150648
-------


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+--+
&

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-28 Thread Hai Lu


> On Sept. 27, 2016, 11:11 p.m., Navina Ramesh wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 70
> > <https://reviews.apache.org/r/51142/diff/6/?file=1506743#file1506743line70>
> >
> > what is the "default-partitioner"? Is it possible to have more than one 
> > partitioner?

Per my dicussion with Yi, we are simply allowing the possibility of more than 
one partitioner in the future by reflecting in the config name. We are not 
going to change samza-api to make it actually happen for now.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150637
-----------


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +--+---

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Hai Lu


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java,
> >  line 24
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493806#file1493806line24>
> >
> > One concern I had w/ this HdfsAvroFileReader/Writer is the version 
> > conflict issue. LinkedIn's Kafka version still uses avro-1.4 in the serde, 
> > while hdfs already uses avro-1.7 in 2.6.1. I guess that we need to find a 
> > solution inside LinkedIn to resolve it. Let's sync up face-to-face tomorrow.
> 
> Hai Lu wrote:
> I was well aware of the avro issue. I tried so many different APIs that I 
> finally found the set of APIs that work for both 1.4 and 1.7
> 
> Yi Pan (Data Infrastructure) wrote:
> Great! I am really curious what are the set of compatible APIs! So, I 
> guess that we just enforce avro-1.4 when compiling samza-hdfs module? I 
> remember that I tried last time and got a build failure in samza-hdfs w/ 
> AvroDataFileHdfsWriter in samza-li build. I am curious how you made it work.
> 
> Navina Ramesh wrote:
> Right now, we exclude samza-hdfs build in samza-li. 
>   "build": "ligradle -PscalaVersion=2.10 -Prelease=true 
> -PallArtifacts build -x:samza-hdfs_2.10:build",
>   
> We may want to fully understand the avro changes introduced by 
> HdfsProducer and/or HdfsConsumer in samza-li. This sounds like a blocker for 
> me right now. How are we going to overcome avro conflict introduced in 
> HdfsSystemProducer?

I know. I included it back in samza-li and it worked just fine. Just need some 
extra dependency to make the tests pass. I have been only using li_trunk to 
deploy to Hadoop's YARN at LinkedIn


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   |

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-27 Thread Hai Lu


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 101
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493801#file1493801line101>
> >
> > Better, to avoid the wasteful remote IO if there are multiple calls to 
> > getPartitionDescriptor from multiple threads, is to use bucketized locks on 
> > the ConcurrentHashMap entries to ensure synchronization in populating a 
> > certain hash map entry. Guava cache implemented the bucketized locking as a 
> > built-in method already: 
> > http://www.tutorialspoint.com/guava/guava_caching_utilities.htm
> 
> Navina Ramesh wrote:
> What was the resolution here? Was there any change to the IO pattern to 
> use caching?

I believe this is just to optimize the situation that multiple calls happen at 
the same time and causing everyone making remote calls. After the change here, 
only the first one will actually make the remote call while everyone else be 
blocked.

It's a very very tiny improvement, to be honest.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
-------


On Sept. 20, 2016, 11:22 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 20, 2016, 11:22 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>| 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-20 Thread Hai Lu
  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
 |  
  
 
+---+--+
 
 |  
| 
 |  HDFSSystemFactory   
| 
 |  
| 
 
+--+


Diffs (updated)
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptorUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-shell/src/main/bash/bash-run-job.sh PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/YarnContainerRunner.java 
dacc52de0a34498a715a299bc69c95aebd1b08ba 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-20 Thread Hai Lu


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 83
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493803#file1493803line83>
> >
> > Question: this seems to be highly related to how the HDFS files are 
> > organized. It is hard to see how a common practice would look like, 
> > especially in open source. Can we make the groupIdentifier pluggable?
> 
> Hai Lu wrote:
> Why is it HDFS specific? At the very least, it can apply to any file 
> system like systems. The idea of grouping (or advanced partitioning) is to 
> allow multiple highly related files or, say AWS S3Objects, to be processed by 
> the same task.
> 
> Anyway, this is sort of pluggable already. If you don't specify 
> "group.pattern" then the group identifier will be the entire file name (i.e. 
> each group will simply be each single file themselves).
> 
> Yi Pan (Data Infrastructure) wrote:
> If the intended implementation is for a generic Partitioner for some 
> non-partitioned data source, we would need to add to the samza-api as a 
> general Partitioner interface, and then add the DirectoryPartitioner as an 
> implementation of the Partitioner interface in yarn package. Ideally, we 
> would need to make the Partitioner class also confgiurable, s.t. user can 
> implement their own customized Partitioner. As a first step, I would agree 
> that we don't expose to the user as a public API and use DirectoryPartitioner 
> as a default implementation. But it would be nice to have the configuration 
> following the scope:
> systems.%s.partitioner..class, 
> systems.%s.partitioner..group-pattern. Let's discuss in 
> person tomorrow.

Per our discussion, will do the config part and skip the samza-api change for 
now.


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java,
> >  line 152
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493806#file1493806line152>
> >
> > Ideally, shouldn't this class also include a avroFilePath variable to 
> > ensure that we never compare checkpoints for two different files?

I don't want this class to know anything about multiple files at all. Just like 
in kafkaSystemAdmin we simply compare two LONGs, we don't do 
system/stream/partition check. So similarly I will do the file path check 
upstream, which is going to be the hdfsSystemAdmin.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-20 Thread Hai Lu


> On Sept. 14, 2016, 6:19 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java,
> >  line 183
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493815#file1493815line183>
> >
> > It would be better to ensure that the index used in each partition is 
> > referring to the sorted list from inputFiles, to guarantee the consistent 
> > ordering/indexing of the same file in the list of input files. Is it 
> > possible that a new file is inserted in the middle of the sorted list? 
> > Maybe sort by create time to ensure the files are always appended to the 
> > greatest index in the sorted list?

This seems to be addressed if we perform validation on the partition 
descriptors that we persisted.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148780
-----------


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Description |   
> +--v--+ +

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-14 Thread Hai Lu


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 142
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493801#file1493801line142>
> >
> > Isn't it clearer to have one loop like below instead of two embedded 
> > loops:
> > while (!isShutdown) {
> >   if (!reader.hasNext()) {
> > break;
> >   }
> >   IncomingMessageEnvelope messageEnvelope = reader.readNext();
> >   try {
> >  super.put()
> >  ...
> >   } catch () {
> >  ...
> >   }
> > }

No. In your case, if the super.put() fails, your code will skip the current 
event and read the next one. Unless you throw a runtime exception in the catch 
block to completely stop the consumption.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  
>  |
>   | 
>+-+ HDFS   
>   | 
>|   Obtain|
>   | 
>|  Partition  
> +--+--^--+-^---+
>  
>| Description|  |  |   
>   | 
>||  |  |   
>   | 
>|  +-v---+  |  |   
> Filtering/| 
>|  | |  |  +---+
> Grouping +-+   
>|  | HDFSAvroFileReader  |  |  |   
> |   
>|  | |Persist   |  |   
> |   
>|  +-+---+   Partition  |  |   
> |   
>||  Descrip

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-14 Thread Hai Lu


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 58
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493803#file1493803line58>
> >
> > nit: since the input whiteList/blakcList are also regex, shouldn't we 
> > just name them the same?

Separating white list and black list simplifies the regex a lot. I see this 
convention in databus, Kafka (http://kafka.apache.org/documentation.html) and 
many other systems.


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 77
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493803#file1493803line77>
> >
> > Make sure that you check-in to open source after we disable JDK7 build. 
> > This won't work for JDK7 build in open source.

Will do.


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 83
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493803#file1493803line83>
> >
> > Question: this seems to be highly related to how the HDFS files are 
> > organized. It is hard to see how a common practice would look like, 
> > especially in open source. Can we make the groupIdentifier pluggable?

Why is it HDFS specific? At the very least, it can apply to any file system 
like systems. The idea of grouping (or advanced partitioning) is to allow 
multiple highly related files or, say AWS S3Objects, to be processed by the 
same task.

Anyway, this is sort of pluggable already. If you don't specify "group.pattern" 
then the group identifier will be the entire file name (i.e. each group will 
simply be each single file themselves).


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 162
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493803#file1493803line162>
> >
> > We had an on-going issue that the partitionId to the ssp mapping does 
> > not seem to be consistent between the job restarts. I suspect that might be 
> > a problem here as well, if the groupedPartitions list is not sorted in a 
> > consistent order?

Wait, that would be a huge issue of Samza... I don't understand how is the 
mapping between partition id and ssp not consistent? The ssp contains the 
partition id itself, right?


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 174
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493803#file1493803line174>
> >
> > Now I see what this PartitionDescriptor really mean... Is it much more 
> > straightforward if renamed to partitionToFilePathsMap?

I started to realize that the name of partition descriptor isn't informative 
enough. I have updated the design doc to list it in the glossary section. My 
problem with "partitionToFilePath(s)" is just that, well, it's not a noun. 
"Descriptor" is more concise.


> On Sept. 13, 2016, 1:37 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java,
> >  line 24
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493806#file1493806line24>
> >
> > One concern I had w/ this HdfsAvroFileReader/Writer is the version 
> > conflict issue. LinkedIn's Kafka version still uses avro-1.4 in the serde, 
> > while hdfs already uses avro-1.7 in 2.6.1. I guess that we need to find a 
> > solution inside LinkedIn to resolve it. Let's sync up face-to-face tomorrow.

I was well aware of the avro issue. I tried so many different APIs that I 
finally found the set of APIs that work for both 1.4 and 1.7


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148629
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-13 Thread Hai Lu


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > Still in the middle but don't want to lose what I had up to now.

Also in the middle of addressing all the feedbacks. Have all the changes 
locally. Will push them altogether later. Thanks again for your review!


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java, 
> > line 47
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493800#file1493800line47>
> >
> > nit: when you refer to the class names, it would be better to use 
> > {@link HdfsSystemAdmin} {@link HdfsSystemConsumer} etc.

Will do.


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java, 
> > line 91
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493800#file1493800line91>
> >
> > You can do:
> > try(FSDataOutputStream fos = fs.create(targetPath)) {
> >   fos.write(PartitionDescriptionUtil.toJson();
> >   }

Do you just intend to narrow down the try block? the "getFileSystem" method 
above will also throw IOExecption, so I have to include everything here.


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java, 
> > line 105
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493800#file1493800line105>
> >
> > What if the PartitionDescriptor already exists? Could it be the case 
> > that the systemStreamMetadata maintains a different copy of 
> > PartitionDescriptor? It is not clear to me which one is the source of 
> > truth? directoryPartitioner.getPartitionMetadataMap()? Or 
> > directoryPartitioner.getPartitionDescriptor()? Maybe I miss some basic 
> > information regarding to the concept on PartitionDescriptor vs 
> > PartitionMetadataMap?

You are right. This is an unsolved problem given that we assume the HDFS folder 
is immutable. So now, what if the HDFS folder really is altered. Before we 
support the mutable HFDS input,  I think we have two options: 1. Throw an 
runtime exception to stop the job if we see inconsistency 2. Ignore the newer 
folder info we got and always treat the partition descriptors that we persisted 
on HDFS as the source of truth. Use the source of truth to proceed with the job 
and don't kill it or throw exception.
Do you have a preference?


> On Sept. 13, 2016, 12:33 a.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 69
> > <https://reviews.apache.org/r/51142/diff/5/?file=1493801#file1493801line69>
> >
> > I would prefer to follow the same pattern as KafkaSystemConsumer, i.e. 
> > passing in the HdfsSystemConsumerMetrics object instead of the 
> > MetricsRegistry. Please check the code in KafkaSystemFactory.getConsumer() 
> > to see how the metrics object are created and passed along.

I can't avoid directly using MetricsRegistry. At the very least, I have to pass 
this to the base class: BlockingEnvelopeMap (unless you want to change all the 
interface in BlockingEnvelopMap as well. Event if we wan t to do that. It seems 
beyond the scope of this RB. I can maybe to a separte fix for that.). But I 
will create a HdfsSystemConsumerMetrics.


- Hai


-----------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review148612
---


On Sept. 9, 2016, 1:34 a.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 9, 2016, 1:34 a.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like source

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-08 Thread Hai Lu
  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 
  
 |  
  
 
+---+--+
 
 |  
| 
 |  HDFSSystemFactory   
| 
 |  
| 
 
+--+


Diffs
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-08 Thread Hai Lu
dfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-08 Thread Hai Lu
/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-07 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Sept. 7, 2016, 11:49 p.m.)


Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description (updated)
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf


Diffs
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing (updated)
---

unit tests pass.

manually tested by writing a real hdfs samza job and deploying to a yarn 
cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-07 Thread Hai Lu


> On Sept. 1, 2016, 9:21 p.m., Navina Ramesh wrote:
> > @lhaiesp: Your patch looks awesome. Happy to review again once you have 
> > addressed the comments. It will be great if you can add some unit test for 
> > HdfsSystemConsumer. Some of the documentation that you will have to include 
> > for this feature will be:
> > * Add newly introduced configs to configuration-table.html
> > * Add newly introduced metrics to metrics-table.html (Pending SAMZA-702 
> > commit)
> > * Add a webpage for describing the behavior of HDFS systemconsumer (or more 
> > generically, consuming from Bounded input sources) and how to use the HDFS 
> > consumer
> > 
> > You can choose to keep the documentation as a part of this RB or create a 
> > follow-up JIRA for documentation and assign it to your self. Ideally, we 
> > don't want to have a lot of gap between code and documentation. 
> > 
> > Thanks for such a thorough work!

I do have a work item for documentation. I will take a look at the existing 
documentations.


> On Sept. 1, 2016, 9:21 p.m., Navina Ramesh wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java,
> >  line 149
> > <https://reviews.apache.org/r/51142/diff/4/?file=1487650#file1487650line149>
> >
> > Isn't numTotalEventsCounter a sum of all counters in numEventsCounter 
> > in the map ? Do we want to maintain a running sum?

It is. I think I was thinking about adding a config to enable/disable per 
partition metrics eventually. In that case a total metrics would be necessary.


> On Sept. 1, 2016, 9:21 p.m., Navina Ramesh wrote:
> > samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java,
> >  line 171
> > <https://reviews.apache.org/r/51142/diff/4/?file=1487652#file1487652line171>
> >
> > Question: Is the generateOldestOffset simply returning a string of "0" 
> > delimited by a comma? The number of "0" matches the number of files in the 
> > group?

Yes.


- Hai


-----------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review147596
---


On Sept. 7, 2016, 11:44 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Sept. 7, 2016, 11:44 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> 
> Diffs
> -
> 
>   build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
>   gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
>  PRE-CREATION 
>   samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
> 61b7570afae3219b618c8830905035063941bdd7 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
> 92eb4472533db67dca01f075cb460581b4bdac0d 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala
>  ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSyst

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-07 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Sept. 7, 2016, 11:44 p.m.)


Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics


Diffs (updated)
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala 
61b7570afae3219b618c8830905035063941bdd7 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestHdfsSystemConsumer.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestMultiFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/test/resources/integTest/emptyTestFile PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
  samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
  samza-hdfs/src/test/resources/reader/TestEvent.avsc PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

tested by writing a real hdfs samza job and deploying to hadoop cluster.


Thanks,

Hai Lu



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-09-06 Thread Hai Lu


> On Aug. 31, 2016, 7:25 a.m., Navina Ramesh wrote:
> > gradle/dependency-versions.gradle, line 39
> > <https://reviews.apache.org/r/51142/diff/4/?file=1487648#file1487648line39>
> >
> > Why is this dependency introduced? Is it possible to get rid of this 
> > dependency ?

Yes. We can remove it here. I only added it because I found that the absence of 
this can cause unit test failure in the li_trunk branch. I guess we can do the 
fix there, instead.


- Hai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review147414
-------


On Aug. 29, 2016, 5:27 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Aug. 29, 2016, 5:27 p.m.)
> 
> 
> Review request for samza, Chris Pettitt, Yi Pan (Data Infrastructure), and 
> Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> 
> Diffs
> -
> 
>   build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
>   gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
>   samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
> PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
>  PRE-CREATION 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
> 92eb4472533db67dca01f075cb460581b4bdac0d 
>   
> samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala
>  ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionUtil.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestHdfsFileSystemAdapter.java
>  PRE-CREATION 
>   
> samza-hdfs/src/test/java/org/apache/samza/system/hdfs/reader/TestAvroFileHdfsReader.java
>  PRE-CREATION 
>   samza-hdfs/src/test/resources/partitioner/testfile01 PRE-CREATION 
>   samza-hdfs/src/test/resources/partitioner/testfile02 PRE-CREATION 
>   
> samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
>  261310d03de204718621f601117f016da14841df 
>   samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
> 4e328a5f8c2b496a71e36c106339b7af263c96c7 
> 
> Diff: https://reviews.apache.org/r/51142/diff/
> 
> 
> Testing
> ---
> 
> unit tests pass.
> 
> tested by writing a real hdfs samza job and deploying to hadoop cluster.
> 
> 
> Thanks,
> 
> Hai Lu
> 
>



Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-08-23 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Aug. 23, 2016, 11:20 p.m.)


Review request for samza.


Changes
---

System consumer


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description (updated)
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics


Diffs (updated)
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/AvroFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/HdfsReaderFactory.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/MultiFileHdfsReader.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/reader/SingleFileHdfsReader.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionJsonUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

tested by writing a real hdfs samza job and deploying to hadoop cluster.


Thanks,

Hai Lu



[DISCUSS] A HDFS system consumer for Samza

2016-08-18 Thread Hai Lu
Hi,

I have been recently working on a HDFS system consumer for Samza. The work
includes two major parts: 1. properly partitioning a HDFS directory and 2.
consuming from HDFS files.

I have attached the design doc in the Jira ticket here:
https://issues.apache.org/jira/browse/SAMZA-967

It would be great to here some feedback from your.

Thanks,
Hai


Review Request 51142: SAMZA-967: HDFS System Consumer

2016-08-16 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

Review request for samza.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer:

Check-in is divided into two parts:

1. System admin, partiioner
2. System consumer with metrics


Diffs
-

  build.gradle 1d4eb74b1294318db8454631ddd0901596121ab2 
  gradle/dependency-versions.gradle 47c71bfde027835682889407261d4798b629d214 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemAdmin.java 
PRE-CREATION 
  samza-hdfs/src/main/java/org/apache/samza/system/hdfs/HdfsSystemConsumer.java 
PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/PartitionDescriptionJsonUtil.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/DirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/FileSystemAdapter.java
 PRE-CREATION 
  
samza-hdfs/src/main/java/org/apache/samza/system/hdfs/partitioner/HdfsFileSystemAdapter.java
 PRE-CREATION 
  samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemAdmin.scala 
92eb4472533db67dca01f075cb460581b4bdac0d 
  
samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsSystemFactory.scala 
ef3c20a097ddf2feecaf8b0ad4587ea4bf6570b7 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/TestPartitionDesctiptionJsonUtil.java
 PRE-CREATION 
  
samza-hdfs/src/test/java/org/apache/samza/system/hdfs/partitioner/TestDirectoryPartitioner.java
 PRE-CREATION 
  
samza-hdfs/src/test/scala/org/apache/samza/system/hdfs/TestHdfsSystemProducerTestSuite.scala
 261310d03de204718621f601117f016da14841df 
  samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnJobFactory.scala 
4e328a5f8c2b496a71e36c106339b7af263c96c7 

Diff: https://reviews.apache.org/r/51142/diff/


Testing
---

unit tests pass.

tested by writing a real hdfs samza job and deploying to hadoop cluster.


Thanks,

Hai Lu