Re: Review Request 52476: Do not load task store which are older than delete tombstones.

2016-10-03 Thread Jake Maes

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52476/#review151273
---



Looks good. Just a couple things below.


samza-core/src/main/scala/org/apache/samza/config/StorageConfig.scala (line 32)


Typo: deletion.retention.ms is not a valid property. 

http://kafka.apache.org/documentation.html#configuration



samza-core/src/test/scala/org/apache/samza/storage/TestTaskStorageManager.scala 
(line 293)


2 nits:
1. Can you swap this test with the next one in terms of position? The tests 
above and below this one are related, so this one breaks them up, which just 
adds cognitive load for the reader.
2. I'm all for descriptive names, but this is almost un-tweet-able. :-) 
Could it be shortened to: 
testStoreDeletedWhenOffsetFileOlderThanDeleteRetention()


- Jake Maes


On Oct. 3, 2016, 5 p.m., Shanthoosh Venkataraman wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52476/
> ---
> 
> (Updated Oct. 3, 2016, 5 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Every local task store is backed up by a kafka changelog topic. Due to log 
> compaction, delete tombstones of the changelog topic have a ttl of 
> delete.retention.ms. Replaying the events from the changelog that has missing 
> delete tombstones, would result in creation of an inconsistent local 
> store(due to the missing of some delete events). This patch deletes the local 
> stores in which difference between current time and last modified time of the 
> offset file is greater than delete.retention.ms during the container startup.
> 
> 
> Diffs
> -
> 
>   samza-core/src/main/java/org/apache/samza/storage/StorageRecovery.java 
> 9329edf7d724f3a0d9235354bb77936f713e3b5f 
>   samza-core/src/main/scala/org/apache/samza/config/StorageConfig.scala 
> be3f1068f7921c9c8f8697577efea6580672813e 
>   samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
> 05a996c98075ea8ed3767af666b9beeb1933f2a6 
>   samza-core/src/main/scala/org/apache/samza/storage/TaskStorageManager.scala 
> 0b7bcdda1639eea8239a69c31bdf42558e9077d2 
>   
> samza-core/src/test/scala/org/apache/samza/storage/TestTaskStorageManager.scala
>  4d40f520e54beb643acd8410c772b75e2f6a9162 
> 
> Diff: https://reviews.apache.org/r/52476/diff/
> 
> 
> Testing
> ---
> 
> Unit testing and manual testing has been done to verify the functionality.
> 
> 
> Thanks,
> 
> Shanthoosh Venkataraman
> 
>



Re: Review Request 47835: SAMZA-914: Initial draft for Java programming APIs on operators supporting DAGs

2016-10-03 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47835/
---

(Updated Oct. 4, 2016, 12:45 a.m.)


Review request for samza, Boris Shkolnik, Chris Pettitt, Chinmay Soman, Jake 
Maes, Navina Ramesh, Jagadish Venkatraman, and Xinyu Liu.


Changes
---

Updated based on Jagadish and Jake's comments. Thanks, guys!


Bugs: SAMZA-914
https://issues.apache.org/jira/browse/SAMZA-914


Repository: samza


Description
---

SAMZA-914: initial draft of operator programming API. Design doc attached to 
SAMZA-914: 
https://issues.apache.org/jira/secure/attachment/12821524/SAMZA-914_%20operator%20Java%20programming%20API%20-%20Google%20Docs.pdf


Diffs (updated)
-

  build.gradle 16facbbf4dff378c561461786ff186bd9eed 
  gradle/dependency-versions.gradle 52e25aa53a1edc85d478b48898621b26508ad4bb 
  samza-api/src/test/java/org/apache/samza/config/TestConfig.java 
5d066c5867e9df9e94e60bde825dedf10703b399 
  
samza-operator/src/main/java/org/apache/samza/operators/api/MessageStream.java 
PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/api/MessageStreams.java 
PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/api/TriggerBuilder.java 
PRE-CREATION 
  samza-operator/src/main/java/org/apache/samza/operators/api/WindowState.java 
PRE-CREATION 
  samza-operator/src/main/java/org/apache/samza/operators/api/Windows.java 
PRE-CREATION 
  samza-operator/src/main/java/org/apache/samza/operators/api/data/Message.java 
PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/api/internal/Operators.java
 PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/api/internal/Trigger.java
 PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/api/internal/WindowFn.java
 PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/api/internal/WindowOutput.java
 PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/impl/ChainedOperators.java
 PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/impl/OperatorImpl.java 
PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/impl/ProcessorContext.java
 PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/impl/SimpleOperatorImpl.java
 PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/impl/StateStoreImpl.java
 PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/operators/impl/window/SessionWindowImpl.java
 PRE-CREATION 
  
samza-operator/src/main/java/org/apache/samza/task/StreamOperatorAdaptorTask.java
 PRE-CREATION 
  samza-operator/src/main/java/org/apache/samza/task/StreamOperatorTask.java 
PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/api/TestTriggerBuilder.java
 PRE-CREATION 
  samza-operator/src/test/java/org/apache/samza/operators/api/TestWindows.java 
PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/api/data/TestIncomingSystemMessage.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/api/data/TestLongOffset.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/api/internal/TestOperators.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/api/internal/TestTrigger.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/impl/TestOperatorImpl.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/impl/TestOutputMessage.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/impl/TestProcessorContext.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/impl/TestSimpleOperatorImpl.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/impl/TestSinkOperatorImpl.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/impl/TestStateStoreImpl.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/operators/impl/window/TestSessionWindowImpl.java
 PRE-CREATION 
  samza-operator/src/test/java/org/apache/samza/task/BroadcastOperatorTask.java 
PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/task/InputJsonSystemMessage.java 
PRE-CREATION 
  samza-operator/src/test/java/org/apache/samza/task/JoinOperatorTask.java 
PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/task/TestStreamOperatorAdaptorTask.java
 PRE-CREATION 
  
samza-operator/src/test/java/org/apache/samza/task/TestStreamOperatorTasks.java 
PRE-CREATION 
  samza-operator/src/test/java/org/apache/samza/task/WindowOperatorTask.java 
PRE-CREATION 
  
samza-sql-calcite/src/test/java/org/apache/samza/sql/calcite/schema/TestAvroSchemaConverter.java
 PRE-CREATION 
  samza-sql-core/README.md PRE-CREATION 
  

Re: Review Request 47835: SAMZA-914: Initial draft for Java programming APIs on operators supporting DAGs

2016-10-03 Thread Yi Pan (Data Infrastructure)


> On Oct. 3, 2016, 11:11 p.m., Jagadish Venkatraman wrote:
> > samza-operator/src/main/java/org/apache/samza/operators/api/MessageStream.java,
> >  line 102
> > 
> >
> > I thought the `Operators` class was maintaining state of the topology. 
> > (and hence, that was the reason for passing in `this` so that we know the 
> > wiring of input to output operators). Is the `MessageStream` or some other 
> > component owning that wire-up?
> > 
> > wondering why we removed `this`?

Adding this here is a bit early when we are not 100% sure what would be the 
best place to save the topology info. Hence, removing for now. I am working on 
the patch for SAMZA-915 that's to wire everything up, in which I am 
re-evaluating whether we should put it in or not.


> On Oct. 3, 2016, 11:11 p.m., Jagadish Venkatraman wrote:
> > samza-operator/src/main/java/org/apache/samza/operators/api/internal/Operators.java,
> >  line 255
> > 
> >
> > nit: variable name `windowFn`. We can use `window` to refer to anything 
> > that implements a `Window`

Good point. Done.


> On Oct. 3, 2016, 11:11 p.m., Jagadish Venkatraman wrote:
> > samza-operator/src/main/java/org/apache/samza/operators/api/internal/Trigger.java,
> >  line 61
> > 
> >
> > nit: *Private* constructor to prevent instantiation.

Done.


> On Oct. 3, 2016, 11:11 p.m., Jagadish Venkatraman wrote:
> > samza-operator/src/main/java/org/apache/samza/task/StreamOperatorAdaptorTask.java,
> >  line 38
> > 
> >
> > When using with HDFS-style consumer (which is bounded), how will we 
> > handle the last batch? 
> > 
> > Shouldn't the adaptor also implement the `EndOfStreamListener` task. 
> > that will guarantee that the last batch is handled correctly? (by 
> > triggering the output corresponding to the last batch).

Not in the current scope yet. I would prefer to leave it for later evaluation, 
since a) end-of-stream is not in production yet; b) this is internal class 
which does not affect user API.


> On Oct. 3, 2016, 11:11 p.m., Jagadish Venkatraman wrote:
> > samza-operator/src/test/java/org/apache/samza/task/WindowOperatorTask.java, 
> > line 50
> > 
> >
> > 1. Wouldn't it be nicer to not have an avro dependency in our 
> > open-source examples? All our open source examples are non-avro. It seems 
> > that this example maybe easy to demo with a simple `JSON`/ a `String` based 
> > example? (It can be a follow-up)
> > 
> > Is this meant to be run as a test? Or as a show-case of the API? 
> > 
> > If it's a show-case, we should move it to samza-hello-world once we're 
> > done. (can be a follow-up)
> > 
> > 2. Also, what do you think about creating a new package namespace 
> > `examples` instead of `test`?

Good point. I will remove the avro examples and replace them w/ JSON. The 
purpose of the test cases here is both for unit test and demo. We can move them 
to examples as you suggest later.


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47835/#review151227
---


On Sept. 29, 2016, 2:05 a.m., Yi Pan (Data Infrastructure) wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47835/
> ---
> 
> (Updated Sept. 29, 2016, 2:05 a.m.)
> 
> 
> Review request for samza, Boris Shkolnik, Chris Pettitt, Chinmay Soman, Jake 
> Maes, Navina Ramesh, Jagadish Venkatraman, and Xinyu Liu.
> 
> 
> Bugs: SAMZA-914
> https://issues.apache.org/jira/browse/SAMZA-914
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> SAMZA-914: initial draft of operator programming API. Design doc attached to 
> SAMZA-914: 
> https://issues.apache.org/jira/secure/attachment/12821524/SAMZA-914_%20operator%20Java%20programming%20API%20-%20Google%20Docs.pdf
> 
> 
> Diffs
> -
> 
>   build.gradle 16facbbf4dff378c561461786ff186bd9eed 
>   gradle/dependency-versions.gradle 52e25aa53a1edc85d478b48898621b26508ad4bb 
>   samza-api/src/test/java/org/apache/samza/config/TestConfig.java 
> 5d066c5867e9df9e94e60bde825dedf10703b399 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/MessageStream.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/MessageStreams.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/TriggerBuilder.java
>  

Re: [DISCUSS] [VOTE] Apache Samza 0.11.0 RC0

2016-10-03 Thread Jagadish Venkatraman
+1 from my side for the release (non-binding)

On Mon, Oct 3, 2016 at 12:36 PM, Boris Shkolnik  wrote:

> +1
>
> On Fri, Sep 30, 2016 at 1:39 PM, xinyu liu  wrote:
>
> > Subject correction: [VOTE] Apache Samza 0.11.0 RC0.
> >
> > Thanks,
> > Xinyu
> >
> > On Fri, Sep 30, 2016 at 12:00 PM, xinyu liu 
> wrote:
> >
> > > Hey all,
> > >
> > > This is a call for a vote on a release of Apache Samza 0.11.0. Thanks
> to
> > > everyone who has contributed to this release. We are very glad to see
> > > some new contributors in this release.
> > >
> > > The release candidate can be downloaded from here:
> > > http://home.apache.org/~xinyu/samza-0.11.0-rc0/
> > >
> > > The release candidate is signed with pgp key C31D7061, which can be
> > found on
> > > keyservers:
> > > http://pgp.mit.edu/pks/lookup?op=get=0xC31D7061
> > >
> > > The git tag is release-0.11.0-rc0 and signed with the same pgp key:
> > > https://git-wip-us.apache.org/repos/asf?p=samza.git;a=tag;h=
> > > refs/tags/release-0.11.0-rc0
> > >
> > > Test binaries have been published to Maven's staging repository, and
> are
> > > available here:
> > > https://repository.apache.org/content/repositories/orgapachesamza-1013
> > >
> > > Note that the binaries were built with JDK7 without incident.
> > >
> > > 38 issues were resolved for this release:
> > > https://issues.apache.org/jira/issues/?jql=project%20%3D%
> > > 20SAMZA%20AND%20fixVersion%20in%20(0.11%2C%200.11.0)%20AND%
> > > 20status%20in%20(Resolved%2C%20Closed)
> > >
> > > The vote will be open for 72 hours ( end in 12:00pm Wednesday,
> 10/05/2016
> > > ).
> > >
> > > Please download the release candidate, check the hashes/signature,
> build
> > > it and test it, and then please vote:
> > >
> > >
> > > [ ] +1 approve
> > >
> > > [ ] +0 no opinion
> > >
> > > [ ] -1 disapprove (and reason why)
> > >
> > >
> > > +1 from my side for the release.
> > >
> > > Cheers!
> > > Xinyu Liu
> > >
> >
>



-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University


Re: Review Request 47835: SAMZA-914: Initial draft for Java programming APIs on operators supporting DAGs

2016-10-03 Thread Jagadish Venkatraman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47835/#review151227
---



Overall, this is looking pretty good! Just some minor comments.


samza-operator/src/main/java/org/apache/samza/operators/api/MessageStream.java 
(line 84)


I thought the `Operators` class was maintaining state of the topology. (and 
hence, that was the reason for passing in `this` so that we know the wiring of 
input to output operators). Is the `MessageStream` or some other component 
owning that wire-up?

wondering why we removed `this`?



samza-operator/src/main/java/org/apache/samza/operators/api/internal/Operators.java
 (line 253)


nit: variable name `windowFn`. We can use `window` to refer to anything 
that implements a `Window`



samza-operator/src/main/java/org/apache/samza/operators/api/internal/Trigger.java
 (line 61)


nit: *Private* constructor to prevent instantiation.



samza-operator/src/main/java/org/apache/samza/task/StreamOperatorAdaptorTask.java
 (line 38)


When using with HDFS-style consumer (which is bounded), how will we handle 
the last batch? 

Shouldn't the adaptor also implement the `EndOfStreamListener` task. that 
will guarantee that the last batch is handled correctly? (by triggering the 
output corresponding to the last batch).



samza-operator/src/test/java/org/apache/samza/task/WindowOperatorTask.java 
(line 49)


1. Wouldn't it be nicer to not have an avro dependency in our open-source 
examples? All our open source examples are non-avro. It seems that this example 
maybe easy to demo with a simple `JSON`/ a `String` based example? (It can be a 
follow-up)

Is this meant to be run as a test? Or as a show-case of the API? 

If it's a show-case, we should move it to samza-hello-world once we're 
done. (can be a follow-up)

2. Also, what do you think about creating a new package namespace 
`examples` instead of `test`?


- Jagadish Venkatraman


On Sept. 29, 2016, 2:05 a.m., Yi Pan (Data Infrastructure) wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47835/
> ---
> 
> (Updated Sept. 29, 2016, 2:05 a.m.)
> 
> 
> Review request for samza, Boris Shkolnik, Chris Pettitt, Chinmay Soman, Jake 
> Maes, Navina Ramesh, Jagadish Venkatraman, and Xinyu Liu.
> 
> 
> Bugs: SAMZA-914
> https://issues.apache.org/jira/browse/SAMZA-914
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> SAMZA-914: initial draft of operator programming API. Design doc attached to 
> SAMZA-914: 
> https://issues.apache.org/jira/secure/attachment/12821524/SAMZA-914_%20operator%20Java%20programming%20API%20-%20Google%20Docs.pdf
> 
> 
> Diffs
> -
> 
>   build.gradle 16facbbf4dff378c561461786ff186bd9eed 
>   gradle/dependency-versions.gradle 52e25aa53a1edc85d478b48898621b26508ad4bb 
>   samza-api/src/test/java/org/apache/samza/config/TestConfig.java 
> 5d066c5867e9df9e94e60bde825dedf10703b399 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/MessageStream.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/MessageStreams.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/TriggerBuilder.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/WindowState.java 
> PRE-CREATION 
>   samza-operator/src/main/java/org/apache/samza/operators/api/Windows.java 
> PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/data/Message.java 
> PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/internal/Operators.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/internal/Trigger.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/internal/WindowFn.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/api/internal/WindowOutput.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/impl/ChainedOperators.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/impl/OperatorImpl.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/impl/ProcessorContext.java
>  PRE-CREATION 
>   
> samza-operator/src/main/java/org/apache/samza/operators/impl/SimpleOperatorImpl.java
>  PRE-CREATION 
>   
> 

Re: [DISCUSS] [VOTE] Apache Samza 0.11.0 RC0

2016-10-03 Thread Boris Shkolnik
+1

On Fri, Sep 30, 2016 at 1:39 PM, xinyu liu  wrote:

> Subject correction: [VOTE] Apache Samza 0.11.0 RC0.
>
> Thanks,
> Xinyu
>
> On Fri, Sep 30, 2016 at 12:00 PM, xinyu liu  wrote:
>
> > Hey all,
> >
> > This is a call for a vote on a release of Apache Samza 0.11.0. Thanks to
> > everyone who has contributed to this release. We are very glad to see
> > some new contributors in this release.
> >
> > The release candidate can be downloaded from here:
> > http://home.apache.org/~xinyu/samza-0.11.0-rc0/
> >
> > The release candidate is signed with pgp key C31D7061, which can be
> found on
> > keyservers:
> > http://pgp.mit.edu/pks/lookup?op=get=0xC31D7061
> >
> > The git tag is release-0.11.0-rc0 and signed with the same pgp key:
> > https://git-wip-us.apache.org/repos/asf?p=samza.git;a=tag;h=
> > refs/tags/release-0.11.0-rc0
> >
> > Test binaries have been published to Maven's staging repository, and are
> > available here:
> > https://repository.apache.org/content/repositories/orgapachesamza-1013
> >
> > Note that the binaries were built with JDK7 without incident.
> >
> > 38 issues were resolved for this release:
> > https://issues.apache.org/jira/issues/?jql=project%20%3D%
> > 20SAMZA%20AND%20fixVersion%20in%20(0.11%2C%200.11.0)%20AND%
> > 20status%20in%20(Resolved%2C%20Closed)
> >
> > The vote will be open for 72 hours ( end in 12:00pm Wednesday, 10/05/2016
> > ).
> >
> > Please download the release candidate, check the hashes/signature, build
> > it and test it, and then please vote:
> >
> >
> > [ ] +1 approve
> >
> > [ ] +0 no opinion
> >
> > [ ] -1 disapprove (and reason why)
> >
> >
> > +1 from my side for the release.
> >
> > Cheers!
> > Xinyu Liu
> >
>


Re: [Discuss] Moving Samza to Java 1.8 source compatibility.

2016-10-03 Thread Boris Shkolnik
+1 for moving to 1.8.

On Wed, Apr 27, 2016 at 6:27 PM, Jacob Maes  wrote:

> Hey everyone,
>
> I wanted to start a discussion to see what folks think about moving to Java
> 1.8 source compatibility at some point after the 10.1 release.
>
> Java 8 has a number of nice features that can help us build more concise,
> maintainable, and robust software. A few notable features that would
> benefit Samza:
> 1. Stream API - provide a compact syntax for expressing transformations on
> collections. These may be foundational for future API work including
> Operators (SAMZA-914)
> 2. Default Methods - enable us to evolve interfaces without breaking
> compatibility
> 3. Concurrent package enhancements - generally make concurrent programming
> easier, which will be more important with features like multithreading
> support (SAMZA-863)
> 4. Lambdas - love them or hate them, they do reduce the amount of
> boilerplate code, especially when used in place of anonymous classes.
>
> It certainly would be nice to leverage some of the features above. However,
> we have historically supported Java versions N and N-1 and it doesn't look
> like Java 9 is coming until next year. So, discontinuing support for Java
> 1.7 at this point would be a departure from our normal support matrix for a
> significant period of time. Thoughts on the pros and cons?
>
> I know some folks in this community are still on Java 1.7. How many of you
> stay up to date with the latest Samza? Do you have a roadmap to move to
> Java 1.8?
>
> Thanks,
> Jake
>


Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-03 Thread Yi Pan (Data Infrastructure)


> On Sept. 29, 2016, 10:02 p.m., Yi Pan (Data Infrastructure) wrote:
> > samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/HdfsConfig.scala, 
> > line 197
> > 
> >
> > Thinking of this more, I would prefer less dependency imposed between 
> > samza-yarn and samza-hdfs modules. Thinking of a case where HDFS consumer 
> > is used by a standalone Samza job, there is no YarnConfig object in the 
> > job. I think we should make this as required config for HdfsSystemConsumer, 
> > just like ZooKeeper connnect string is required for KafkaSystemConsumer.
> > 
> > Also, under which condition we need to clear the partition descriptor 
> > info in the staging dir? We need to think about the cleanup procedure as 
> > well.
> 
> Hai Lu wrote:
> We need to remove partition descriptors when job is done. Not doing so 
> would end up spamming user's HDFS space, causing immediate troubles to our 
> users. 
> 
> But right now there is no way that HdfsSystemConsumer/Admin would know 
> when the job is shutdown. So I don't see there is a solution if we don't 
> directly/indirectly depend on YARN, since only the YARN codes have this idea 
> of staging directory, and actually clean up the directory at the end of the 
> job.  I think what we really need to do, long term, is to support staging 
> direcotry in the Samza level, so that in addition to YARN, other platforms 
> like Docker, Mesos, Standalone can work as well.
> 
> Plus we have to keep in mind that only YARN has the kerberos support for 
> now. So currently HDFS systems ARE depending on YARN in that sense. Security 
> is one more thing to deal with (aside from staging directory) before we can 
> say HDFS sytems no long depends on YARN.
> 
> What do you think? I will keep this issue open.

There are two different levels of dependencies here: a) code-level dependency 
that means the HdfsSystemConsumer code depends directly on samza-yarn classes; 
b) config/semantic dependency that means some expected behavior of a certain 
function (i.e. cleanup) depends on other modules. I would prefer to remove the 
code-level dependency from the beginning. We can still set the configuration of 
HdfsSystemConsumer to use the same staging directory configuration from 
samza-yarn to achieve the cleanup function. This means that HdfsSystemConsumer 
itself does not support after-completion cleanup yet and depends on samza-yarn 
to clean up. It is a configure-level dependency and we have the freedom to 
remove this w/o code change when either a) HdfsSystemConsumer can cleanup the 
staging directory after end-of-stream; b) staging directory config is moved to 
samza-core. Thoughts?


- Yi


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/#review150949
---


On Oct. 3, 2016, 4:54 p.m., Hai Lu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51142/
> ---
> 
> (Updated Oct. 3, 2016, 4:54 p.m.)
> 
> 
> Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.
> 
> 
> Bugs: SAMZA-967
> https://issues.apache.org/jira/browse/SAMZA-967
> 
> 
> Repository: samza
> 
> 
> Description
> ---
> 
> Add HDFS System Consumer: 
> 
> 1. System admin, partitioner
> 2. System consumer with metrics
> 
> Design doc can be found here: 
> https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf
> 
> An overview of the high level architecture: 
> 
> The system factory is used by Samza to instantiate SystemConsumer, 
> SystemProducer, and SystemAdmin for a specific system. The 
> FileDataSystemFactory can be reused for other file system like sources. 
> 
> HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
> HDFS files need to be consumed for this job. The DirectoryPartitioner also 
> uses “GroupingPattern” to group files into partitions if advanced 
> partitioning is required. HDFSSystemAdmin will then persist the 
> “PartitionDescriptor” to HDFS.
> 
> The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
> Based on this information as well as the actual assignment of partitions, it 
> would then know which files to read from.
> 
> The initial implementation of the HDFS system consumer supports only avro 
> data files. It’s very easy to extend it to a variety of file format by 
> implementing the FileReader interface.
> 
>   
> 
>  
> +--+
>  

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-03 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Oct. 3, 2016, 4:54 p.m.)


Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.


Changes
---

End of stream support merge


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

The system factory is used by Samza to instantiate SystemConsumer, 
SystemProducer, and SystemAdmin for a specific system. The 
FileDataSystemFactory can be reused for other file system like sources. 

HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
HDFS files need to be consumed for this job. The DirectoryPartitioner also uses 
“GroupingPattern” to group files into partitions if advanced partitioning is 
required. HDFSSystemAdmin will then persist the “PartitionDescriptor” to HDFS.

The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
Based on this information as well as the actual assignment of partitions, it 
would then know which files to read from.

The initial implementation of the HDFS system consumer supports only avro data 
files. It’s very easy to extend it to a variety of file format by implementing 
the FileReader interface.


  
 
+--+
 
 |  
| 
   +-+ HDFS 
| 
   |   Obtain|  
| 
   |  Partition  
+--+--^--+-^---+
 
   | Description|  |  | 
| 
   ||  |  | 
| 
   |  +-v---+  |  |   
Filtering/| 
   |  | |  |  +---+
Grouping +-+   
   |  | HDFSAvroFileReader  |  |  | 
  |   
   |  | |Persist   |  | 
  |   
   |  +-+---+   Partition  |  | 
  |   
   ||  Description |   
+--v--+ +--+--+
   ||  |   |
 | | |
   |  +-+---+  |   |Directory 
Partitioner| |   HDFSAvroWriter|
   |  | IFileReader |  |   |
 | | |
   |  | |  |   
+--+--+ +--+--+
   |  +-+---+  |  | 
  |   
   ||  |  | 
  |   
   ||  |  | 
  |   
   |  +-+---+
+-+--++   +--+--+
   |  | || 
|   | |
   |  | HDFSSystemConsumer  ||   HDFSSystemAdmin   
|   | HDFSSystemProducer  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++ 

Re: Review Request 51142: SAMZA-967: HDFS System Consumer

2016-10-03 Thread Hai Lu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51142/
---

(Updated Oct. 3, 2016, 4:04 p.m.)


Review request for samza, Yi Pan (Data Infrastructure) and Navina Ramesh.


Bugs: SAMZA-967
https://issues.apache.org/jira/browse/SAMZA-967


Repository: samza


Description
---

Add HDFS System Consumer: 

1. System admin, partitioner
2. System consumer with metrics

Design doc can be found here: 
https://issues.apache.org/jira/secure/attachment/12824078/HDFSSystemConsumer.pdf

An overview of the high level architecture: 

The system factory is used by Samza to instantiate SystemConsumer, 
SystemProducer, and SystemAdmin for a specific system. The 
FileDataSystemFactory can be reused for other file system like sources. 

HDFSSystemAdmin will start a “DirectoryPartitioner” to figure out the set of 
HDFS files need to be consumed for this job. The DirectoryPartitioner also uses 
“GroupingPattern” to group files into partitions if advanced partitioning is 
required. HDFSSystemAdmin will then persist the “PartitionDescriptor” to HDFS.

The HDFSSystemConsumer will then pick up the “PartitionDescriptor” from HDFS. 
Based on this information as well as the actual assignment of partitions, it 
would then know which files to read from.

The initial implementation of the HDFS system consumer supports only avro data 
files. It’s very easy to extend it to a variety of file format by implementing 
the FileReader interface.


  
 
+--+
 
 |  
| 
   +-+ HDFS 
| 
   |   Obtain|  
| 
   |  Partition  
+--+--^--+-^---+
 
   | Description|  |  | 
| 
   ||  |  | 
| 
   |  +-v---+  |  |   
Filtering/| 
   |  | |  |  +---+
Grouping +-+   
   |  | HDFSAvroFileReader  |  |  | 
  |   
   |  | |Persist   |  | 
  |   
   |  +-+---+   Partition  |  | 
  |   
   ||  Description |   
+--v--+ +--+--+
   ||  |   |
 | | |
   |  +-+---+  |   |Directory 
Partitioner| |   HDFSAvroWriter|
   |  | IFileReader |  |   |
 | | |
   |  | |  |   
+--+--+ +--+--+
   |  +-+---+  |  | 
  |   
   ||  |  | 
  |   
   ||  |  | 
  |   
   |  +-+---+
+-+--++   +--+--+
   |  | || 
|   | |
   |  | HDFSSystemConsumer  ||   HDFSSystemAdmin   
|   | HDFSSystemProducer  |
   +--> || 
|   | |
  +-+---+
+---+-+   +--+--+
||  
  |   

+++