Re: Question about sub-projects and project merging

2015-07-13 Thread Greg Stein
Hi Jay,

Looking at your question, I see the Apache Samza and Apache Kafka
*communities* have little overlap(*). The Board looks at communities, and
their overlap or lack thereof. Smushing two communities under one TLP is
what we have historically called an umbrella TLP, and discourage.
Communities should be allowed to operate independently.

If you have *one* community, then one TLP makes sense.

If you have *two* communities, then increase the overlap. When they look
like one community, and that one community votes to merge TLPs ... then ask
for that.

Cheers,
-g

(*) 2 common PMC members, 3 common committers.


On Mon, Jul 13, 2015 at 12:37 AM, Jay Kreps jay.kr...@gmail.com wrote:

 Hey board members,

 There is a longish thread on the Apache Samza mailing list on the
 relationship between Kafka and Samza and whether they wouldn't make a lot
 more sense as a single project. This raised some questions I was hoping to
 get advice on.

 Discussion thread (warning: super long, I attempt to summarize relevant
 bits below):

 http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3ccabyby7d_-jcxj7fizsjuebjedgbep33flyx3nrozt0yeox9...@mail.gmail.com%3E

 Anyhow, some people thought Apache has lot's of sub-projects, that would
 be a graceful way to step in the right direction. At that point others
 popped up and said, sub-projects are discouraged by the board.

 I'm not sure if we understand technically what a subproject is, but I
 think it means a second repo/committership under the same PMC.

 A few questions:
 - Is that what a sub-project is?
 - Are they discouraged? If so, why?
 - Assuming it makes sense in this case what is the process for making one?
 - Putting aside sub-projects as a mechanism what are examples where
 communities merged successfully? We were pointed towards Lucene/SOLR. Are
 there others?

 Relevant background info:
 - Samza depends on Kafka, but not vice versa
 - There is some overlap in committers but not extensive (3/11 Samza
 committers are also Kafka committers)

 Thanks for the advice!

 -Jay






Re: Thoughts and obesrvations on Samza

2015-07-13 Thread Yan Fang
I am leaning to Jay's fifth approach. It is not radical and gives us some
time to see the outcome.

In addition, I would suggest:

1) Keep the SystemConsumer/SystemProducer API. Because current
SystemConsumer/SystemProducer API satisfies the usage (From Joardan, and
even Garry's feedback) and is not so broken that we want to deprecate it.
Though there are some issues in implemnting the Kinesis, they are not
unfixable. Nothing should prevent Samza, as a stream processing system, to
support other systems. In addition, there already are some systems
exiting besides Kafka: ElasticSearch (committed to the master), HDFS
(patch-available), S3( from the mailing list), Kinesis (developing in
another repository), ActiveMQ (in two months). We may want to see how those
go before we kill them.

2) Can have some Samza devs involved in Kafka's transformer client API.
This can not only help the future integration (if any) much easier, because
they have knowledge about both systems, but also good for Kafka's
community, because Samza devs have the streaming process experience that
Kafka devs may miss.

3) Samza's partition management system may still support other systems.
Though the partition management logic in samza-kafka will be moved to
Kafka, its still useful for other systems that do not have the partition
management layer.

4) Start sharing the docs/websites and using the same terminology (though
do not know how to do this exactly. :). This will reduce the future
confusion and does not hurt Samza's independency.

In my opinion, Samza, as a standalone project, still can (and already)
heavily replying on Kafka, and even more tuned for Kafka-specific usage.
Kafka, also can embed Samza in the document, I do not see anything prevent
doing this.

Thanks,

Fang, Yan
yanfang...@gmail.com

On Mon, Jul 13, 2015 at 11:25 AM, Jordan Shaw jor...@pubnub.com wrote:

 Jay,
 I think doing this iteratively in smaller chunks is a better way to go as
 new issues arise. As Navina said Kafka is a stream system and Samza is a
 stream processor and those two ideas should be mutually exclusive.

 -Jordan

 On Mon, Jul 13, 2015 at 10:06 AM, Jay Kreps jay.kr...@gmail.com wrote:

  Hmm, thought about this more. Maybe this is just too much too quick.
  Overall I think there is some enthusiasm for the proposal but it's not
  really unanimous enough to make any kind of change this big cleanly. The
  board doesn't really like the merging stuff, user's are concerned about
  compatibility, I didn't feel there was unanimous agreement on dropping
  SystemConsumer, etc. Even if this is the right end state to get to,
  probably trying to push all this through at once isn't the right way to
 do
  it.
 
  So let me propose a kind of fifth (?) option which I think is less
 dramatic
  and let's things happen gradually. I think this is kind of like combining
  the first part of Yi's proposal and Jakob's third option, leaving the
 rest
  to be figured out incrementally:
 
  Option 5: We continue the prototype I shared and propose that as a kind
 of
  transformer client API in Kafka. This isn't really a full-fledged
 stream
  processing layer, more like a supped up consumer api for munging topics.
  This would let us figure out some of the technical bits, how to do this
 on
  Kafka's group management features, how to integrate the txn feature to do
  the exactly-once stuff in these transformations, and get all this stuff
  solid. This api would have valid uses in it's own right, especially when
  your transformation will be embedded inside an existing service or
  application which isn't possible with Samza (or other existing systems
 that
  I know of).
 
  Independently we can iterate on some of the ideas of the original
 proposal
  individually and figure out how (if at all) to make use of this
  functionality. This can be done bit-by-bit:
  - Could be that the existing StreamTask API ends up wrapping this
  - Could end up exposed directly in Samza as Yi proposed
  - Could be that just the lower-level group-management stuff get's used,
 and
  in this case it could be either just for standalone mode, or always
  - Could be that it stays as-is
 
  The advantage of this is it is lower risk...we basically don't have to
 make
  12 major decisions all at once that kind of hinge on what amounts to a
  pretty aggressive rewrite. The disadvantage of this is it is a bit more
  confusing as all this is getting figured out.
 
  As with some of the other stuff, this would require a further discussion
 in
  the Kafka community if people do like this approach.
 
  Thoughts?
 
  -Jay
 
 
 
 
  On Sun, Jul 12, 2015 at 10:52 PM, Jay Kreps jay.kr...@gmail.com wrote:
 
   Hey Chris,
  
   Yeah, I'm obviously in favor of this.
  
   The sub-project approach seems the ideal way to take a graceful step in
   this direction, so I will ping the board folks and see why they are
   discouraged, it would be good to understand that. If we go that route
 we
   would need to do a similar 

Re: Question about sub-projects and project merging

2015-07-13 Thread Jay Kreps
Hey Mike,

Thanks for sharing, it is helpful to hear the experience that leads to
these recommendations.

-Jay

On Mon, Jul 13, 2015 at 11:01 AM, Mike Kienenberger mkien...@gmail.com
wrote:

 A subproject is one of many projects that fall under the same umbrella
 project management committee (PMC).   It doesn't have to be a separate
 repo, but it generally has a separate community or a subset of the
 full community.

 Speaking as a long-time PMC member for MyFaces, our problem with
 subprojects (we have 11!) is that it's hard to keep accountability and
 monitor community health.

 A subproject starts of being active with some subset of the community,
 but then reduces activity at some future point.   Those who aren't
 directly involved with the subproject tend not to notice that the
 particular subproject has fallen to unhealthy levels.   Generally, you
 don't realize something is wrong until after all of the developers
 have left when you suddenly realize that there's no one answering
 questions, applying patches, or familiar with the code base.

 Non-umbrella projects report to the board are expected to evaluate
 community health each quarter.   Umbrella projects are also supposed
 to do this, but often fail to realize that community health has to be
 individually evaluated for each subproject each quarter.   The PMC
 chair is likely not directly involved with each subproject, and may
 not be in a good position to evaluate the sub-project's health.  As
 Hervé mentions, this is particularly true for TLPs which have a main
 project and optional modules where everyone cares about the main
 project and only a few care about each module subproject.   This is
 what happened with MyFaces.

 What tends to happen with umbrella projects is that you end up
 creating two-tier project management.  Those responsible to the board
 are upper management but may not be directly involved and fail to
 understand the subproject community health.  Those who are supposed to
 actively manage the project are lower management and are not
 directly responsible to the board for quarterly reports.

 Best practice is to have a one-tier PMC.  As soon as a subproject is
 healthy enough to stand on its own, it probably should go TLP.
 MyFaces successfully spun off DeltaSpike, and DeltaSpike remains
 healthy.  The other alternative is to be certain to address the status
 of each subproject in the board report, much like the Incubator board
 report does each time.

 My advice is the same as others -- keep the two projects separate, but
 encourage individual Samza committers join as Kafka committers if they
 feel the need to do so.

 On Mon, Jul 13, 2015 at 1:37 AM, Jay Kreps jay.kr...@gmail.com wrote:
  Hey board members,
 
  There is a longish thread on the Apache Samza mailing list on the
  relationship between Kafka and Samza and whether they wouldn't make a lot
  more sense as a single project. This raised some questions I was hoping
 to
  get advice on.
 
  Discussion thread (warning: super long, I attempt to summarize relevant
 bits
  below):
 
 http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3ccabyby7d_-jcxj7fizsjuebjedgbep33flyx3nrozt0yeox9...@mail.gmail.com%3E
 
  Anyhow, some people thought Apache has lot's of sub-projects, that
 would be
  a graceful way to step in the right direction. At that point others
 popped
  up and said, sub-projects are discouraged by the board.
 
  I'm not sure if we understand technically what a subproject is, but I
 think
  it means a second repo/committership under the same PMC.
 
  A few questions:
  - Is that what a sub-project is?
  - Are they discouraged? If so, why?
  - Assuming it makes sense in this case what is the process for making
 one?
  - Putting aside sub-projects as a mechanism what are examples where
  communities merged successfully? We were pointed towards Lucene/SOLR. Are
  there others?
 
  Relevant background info:
  - Samza depends on Kafka, but not vice versa
  - There is some overlap in committers but not extensive (3/11 Samza
  committers are also Kafka committers)
 
  Thanks for the advice!
 
  -Jay
 
 
 



Re: Thoughts and obesrvations on Samza

2015-07-13 Thread Jordan Shaw
Jay,
I think doing this iteratively in smaller chunks is a better way to go as
new issues arise. As Navina said Kafka is a stream system and Samza is a
stream processor and those two ideas should be mutually exclusive.

-Jordan

On Mon, Jul 13, 2015 at 10:06 AM, Jay Kreps jay.kr...@gmail.com wrote:

 Hmm, thought about this more. Maybe this is just too much too quick.
 Overall I think there is some enthusiasm for the proposal but it's not
 really unanimous enough to make any kind of change this big cleanly. The
 board doesn't really like the merging stuff, user's are concerned about
 compatibility, I didn't feel there was unanimous agreement on dropping
 SystemConsumer, etc. Even if this is the right end state to get to,
 probably trying to push all this through at once isn't the right way to do
 it.

 So let me propose a kind of fifth (?) option which I think is less dramatic
 and let's things happen gradually. I think this is kind of like combining
 the first part of Yi's proposal and Jakob's third option, leaving the rest
 to be figured out incrementally:

 Option 5: We continue the prototype I shared and propose that as a kind of
 transformer client API in Kafka. This isn't really a full-fledged stream
 processing layer, more like a supped up consumer api for munging topics.
 This would let us figure out some of the technical bits, how to do this on
 Kafka's group management features, how to integrate the txn feature to do
 the exactly-once stuff in these transformations, and get all this stuff
 solid. This api would have valid uses in it's own right, especially when
 your transformation will be embedded inside an existing service or
 application which isn't possible with Samza (or other existing systems that
 I know of).

 Independently we can iterate on some of the ideas of the original proposal
 individually and figure out how (if at all) to make use of this
 functionality. This can be done bit-by-bit:
 - Could be that the existing StreamTask API ends up wrapping this
 - Could end up exposed directly in Samza as Yi proposed
 - Could be that just the lower-level group-management stuff get's used, and
 in this case it could be either just for standalone mode, or always
 - Could be that it stays as-is

 The advantage of this is it is lower risk...we basically don't have to make
 12 major decisions all at once that kind of hinge on what amounts to a
 pretty aggressive rewrite. The disadvantage of this is it is a bit more
 confusing as all this is getting figured out.

 As with some of the other stuff, this would require a further discussion in
 the Kafka community if people do like this approach.

 Thoughts?

 -Jay




 On Sun, Jul 12, 2015 at 10:52 PM, Jay Kreps jay.kr...@gmail.com wrote:

  Hey Chris,
 
  Yeah, I'm obviously in favor of this.
 
  The sub-project approach seems the ideal way to take a graceful step in
  this direction, so I will ping the board folks and see why they are
  discouraged, it would be good to understand that. If we go that route we
  would need to do a similar discussion in the Kafka list (but makes sense
 to
  figure out first if it is what Samza wants).
 
  Irrespective of how it's implemented, though, to me the important things
  are the following:
  1. Unify the website, config, naming, docs, metrics, etc--basically fix
  the product experience so the stream and the processing feel like a
  single user experience and brand. This seems minor but I think is a
 really
  big deal.
  2. Make standalone mode a first class citizen and have a real technical
  plan to be able to support cluster managers other than YARN.
  3. Make the config and out-of-the-box experience more usable
 
  I think that prototype gives a practical example of how 1-3 could be done
  and we should pursue it. This is a pretty radical change, so I wouldn't
 be
  shocked if people didn't want to take a step like that.
 
  Maybe it would make sense to see if people are on board with that general
  idea, and then try to get some advice on sub-projects in parallel and
 nail
  down those details?
 
  -Jay
 
  On Sun, Jul 12, 2015 at 5:54 PM, Chris Riccomini criccom...@apache.org
  wrote:
 
  Hey all,
 
  I want to start by saying that I'm absolutely thrilled to be a part of
  this
  community. The amount of level-headed, thoughtful, educated discussion
  that's gone on over the past ~10 days is overwhelming. Wonderful.
 
  It seems like discussion is waning a bit, and we've reached some
  conclusions. There are several key emails in this threat, which I want
 to
  call out:
 
  1. Jakob's summary of the three potential ways forward.
 
 
 
 http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVu-hxdBfyQ4qm3LDC55cUQbPdmbe4zGzTOOatYF1Pz43A%40mail.gmail.com%3E
  2. Julian's call out that we should be focusing on community over code.
 
 
 
 http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCAPSgeESZ_7bVFbwN%2Bzqi5MH%3D4CWu9MZUSanKg0-1woMqt55Fvg%40mail.gmail.com%3E
  3. Martin's 

Re: Thoughts and obesrvations on Samza

2015-07-13 Thread Jay Kreps
Hmm, thought about this more. Maybe this is just too much too quick.
Overall I think there is some enthusiasm for the proposal but it's not
really unanimous enough to make any kind of change this big cleanly. The
board doesn't really like the merging stuff, user's are concerned about
compatibility, I didn't feel there was unanimous agreement on dropping
SystemConsumer, etc. Even if this is the right end state to get to,
probably trying to push all this through at once isn't the right way to do
it.

So let me propose a kind of fifth (?) option which I think is less dramatic
and let's things happen gradually. I think this is kind of like combining
the first part of Yi's proposal and Jakob's third option, leaving the rest
to be figured out incrementally:

Option 5: We continue the prototype I shared and propose that as a kind of
transformer client API in Kafka. This isn't really a full-fledged stream
processing layer, more like a supped up consumer api for munging topics.
This would let us figure out some of the technical bits, how to do this on
Kafka's group management features, how to integrate the txn feature to do
the exactly-once stuff in these transformations, and get all this stuff
solid. This api would have valid uses in it's own right, especially when
your transformation will be embedded inside an existing service or
application which isn't possible with Samza (or other existing systems that
I know of).

Independently we can iterate on some of the ideas of the original proposal
individually and figure out how (if at all) to make use of this
functionality. This can be done bit-by-bit:
- Could be that the existing StreamTask API ends up wrapping this
- Could end up exposed directly in Samza as Yi proposed
- Could be that just the lower-level group-management stuff get's used, and
in this case it could be either just for standalone mode, or always
- Could be that it stays as-is

The advantage of this is it is lower risk...we basically don't have to make
12 major decisions all at once that kind of hinge on what amounts to a
pretty aggressive rewrite. The disadvantage of this is it is a bit more
confusing as all this is getting figured out.

As with some of the other stuff, this would require a further discussion in
the Kafka community if people do like this approach.

Thoughts?

-Jay




On Sun, Jul 12, 2015 at 10:52 PM, Jay Kreps jay.kr...@gmail.com wrote:

 Hey Chris,

 Yeah, I'm obviously in favor of this.

 The sub-project approach seems the ideal way to take a graceful step in
 this direction, so I will ping the board folks and see why they are
 discouraged, it would be good to understand that. If we go that route we
 would need to do a similar discussion in the Kafka list (but makes sense to
 figure out first if it is what Samza wants).

 Irrespective of how it's implemented, though, to me the important things
 are the following:
 1. Unify the website, config, naming, docs, metrics, etc--basically fix
 the product experience so the stream and the processing feel like a
 single user experience and brand. This seems minor but I think is a really
 big deal.
 2. Make standalone mode a first class citizen and have a real technical
 plan to be able to support cluster managers other than YARN.
 3. Make the config and out-of-the-box experience more usable

 I think that prototype gives a practical example of how 1-3 could be done
 and we should pursue it. This is a pretty radical change, so I wouldn't be
 shocked if people didn't want to take a step like that.

 Maybe it would make sense to see if people are on board with that general
 idea, and then try to get some advice on sub-projects in parallel and nail
 down those details?

 -Jay

 On Sun, Jul 12, 2015 at 5:54 PM, Chris Riccomini criccom...@apache.org
 wrote:

 Hey all,

 I want to start by saying that I'm absolutely thrilled to be a part of
 this
 community. The amount of level-headed, thoughtful, educated discussion
 that's gone on over the past ~10 days is overwhelming. Wonderful.

 It seems like discussion is waning a bit, and we've reached some
 conclusions. There are several key emails in this threat, which I want to
 call out:

 1. Jakob's summary of the three potential ways forward.


 http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVu-hxdBfyQ4qm3LDC55cUQbPdmbe4zGzTOOatYF1Pz43A%40mail.gmail.com%3E
 2. Julian's call out that we should be focusing on community over code.


 http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCAPSgeESZ_7bVFbwN%2Bzqi5MH%3D4CWu9MZUSanKg0-1woMqt55Fvg%40mail.gmail.com%3E
 3. Martin's summary about the benefits of merging communities.


 http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CBFB866B6-D9D8-4578-93C0-FFAEB1DF00FC%40kleppmann.com%3E
 4. Jakob's comments about the distinction between community and code
 paths.


 http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVtWPjHLLDsmxvz9KggVA5DfBi-nUvfqB6QdA-du%2B_a9Ng%40mail.gmail.com%3E

 I 

Re: Review Request 36274: SAMZA-401: getCpuTime to truly calculate duty cycle of the event loop

2015-07-13 Thread Yan Fang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36274/#review91527
---



samza-core/src/main/scala/org/apache/samza/container/RunLoop.scala (lines 75 - 
76)
https://reviews.apache.org/r/36274/#comment144953

for more accurate, I think the activeNs should go before totalNs.


- Yan Fang


On July 7, 2015, 7:08 p.m., Luis De Pombo wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/36274/
 ---
 
 (Updated July 7, 2015, 7:08 p.m.)
 
 
 Review request for samza.
 
 
 Repository: samza
 
 
 Description
 ---
 
 SAMZA-401: getCpuTime to truly calculate duty cycle of the event loop
 
 
 Diffs
 -
 
   samza-core/src/main/scala/org/apache/samza/container/RunLoop.scala 
 c292ae47cd89ef0f25dc682c02dd288e2ba6dcc5 
   samza-core/src/main/scala/org/apache/samza/util/TimerUtils.scala 
 1643070dd710efb9ade9eb5812dabd6fa60ce023 
   samza-core/src/main/scala/org/apache/samza/util/Util.scala 
 2feb65b729b45fbc3b83a75c4072527e3c4e60be 
   samza-core/src/test/scala/org/apache/samza/container/TestRunLoop.scala 
 64a5844bdb343a3c509cba059b9f3b9a19dc9eff 
 
 Diff: https://reviews.apache.org/r/36274/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Luis De Pombo
 




You're invited to an Apache Samza meetup hosted at LinkedIn on July 21 @ 6PM

2015-07-13 Thread Ed Yakabosky
Hi all -

The Samza development team invites you to join us at an Apache 
Samzahttp://samza.apache.org/ meetup hosted in the Unite conference room at 
LinkedIn's Mountain View campus on Tuesday, July 21 at 6PM.  Food/drinks and 
streaming video will be provided.  Please RSVP 
herehttp://www.meetup.com/Bay-Area-Samza-Meetup/events/223768847/ if you plan 
to attend the event in-person.

Here are two topics we’ll be covering.  We’ll probably have a 3rd topic which 
is still under discussion.


Harvesting the Power of Samza in LinkedIn's Feed - LinkedIn's Feed is the entry 
point for hundreds of millions of members who seek to stay informed about their 
professional interests. The feed strives to provide relevant content to members 
that's also new and fresh. How does the feed solve this problem at scale? What 
role does Samza play in this? Join us to find out.

Athena - Stream porcessing platform @ Uber - We present Athena - a stream 
processing platform for Uber's near real time analytics applications, built 
using Samza. We will be discussing some of the existing and upcoming use cases 
and how they impact the Uber partners / riders. We will be talking about the 
tooling built around Samza for easier user on-boarding - such as deployment 
manager, integration with typesafe config system, unit test framework, Graphite 
integration, metric whitelisting and so on. We'll also go over some of the 
issues observed during this process.

Hope to see you there!
Ed Yakabosky (TPM, Samza @ LinkedIn)


Re: Thoughts and obesrvations on Samza

2015-07-13 Thread Yi Pan
Hi, Garry,

Just want to chime in to state our experience in LinkedIn. In LinkedIn, we
have a lot of aggregation/transformation stream processing jobs that falls
into the transformation category. That's also the motivation for us to
develop the SQL layer on top of streams to allow easy programming model for
data transformation on streams. Ingestion from wide-range of sources and
egress to some serving tier are important, but I would argue that w/o the
transformation in between, there is not much value added by stream
processing.

Just my 2-cents.

On Mon, Jul 13, 2015 at 12:56 PM, Garry Turkington 
g.turking...@improvedigital.com wrote:

 Hi,

 I'm also supportive of Jay's option 5. There is a risk the transformer
 API -- I'd have preferred Metamorphosis but it's too hard to type! --
 takes on a life of its own and we end up with two very different things but
 given how good the Kafka community has been at introducing new producer and
 consumer clients and giving very clear guidance on when they are production
 ready this is a danger I believe can be managed. It'd also be excellent to
 get some working code to beat around the notions of stream processing atop
 a system with transacdtional messages.

 On the question of whether to keep or deprecate SystemConsumer/Producer I
 believe we need get a better understanding over the next while of just what
 the Samza community is looking for in such connectivity. For my own use
 cases I have been looking to add additional implementations primarily to
 use Samza as the data ingress and egress component around Kafka. Writing
 external clients that require their own reliability and scalability
 management gets old real fast and pushing this into a simple Samza job that
 reads from system X and pushes into Kafka (or vice versa) was the obvious
 choice for me in the current model. For this type of usage though copycat
 is likely much superior (obviously needs proven) and the question then is
 if most Samza users look to the system implementations to also act as a
 front-end into Kafka or if significant usage is indeed intended to have the
 alternative systems as the primary message source. That understanding will
 I think give much clarity in just what value the abstraction overhead of
 the current model brings.

 Garry

 -Original Message-
 From: Yan Fang [mailto:yanfang...@gmail.com]
 Sent: 13 July 2015 19:58
 To: dev@samza.apache.org
 Subject: Re: Thoughts and obesrvations on Samza

 I am leaning to Jay's fifth approach. It is not radical and gives us some
 time to see the outcome.

 In addition, I would suggest:

 1) Keep the SystemConsumer/SystemProducer API. Because current
 SystemConsumer/SystemProducer API satisfies the usage (From Joardan, and
 even Garry's feedback) and is not so broken that we want to deprecate it.
 Though there are some issues in implemnting the Kinesis, they are not
 unfixable. Nothing should prevent Samza, as a stream processing system, to
 support other systems. In addition, there already are some systems
 exiting besides Kafka: ElasticSearch (committed to the master), HDFS
 (patch-available), S3( from the mailing list), Kinesis (developing in
 another repository), ActiveMQ (in two months). We may want to see how those
 go before we kill them.

 2) Can have some Samza devs involved in Kafka's transformer client API.
 This can not only help the future integration (if any) much easier, because
 they have knowledge about both systems, but also good for Kafka's
 community, because Samza devs have the streaming process experience that
 Kafka devs may miss.

 3) Samza's partition management system may still support other systems.
 Though the partition management logic in samza-kafka will be moved to
 Kafka, its still useful for other systems that do not have the partition
 management layer.

 4) Start sharing the docs/websites and using the same terminology (though
 do not know how to do this exactly. :). This will reduce the future
 confusion and does not hurt Samza's independency.

 In my opinion, Samza, as a standalone project, still can (and already)
 heavily replying on Kafka, and even more tuned for Kafka-specific usage.
 Kafka, also can embed Samza in the document, I do not see anything prevent
 doing this.

 Thanks,

 Fang, Yan
 yanfang...@gmail.com

 On Mon, Jul 13, 2015 at 11:25 AM, Jordan Shaw jor...@pubnub.com wrote:

  Jay,
  I think doing this iteratively in smaller chunks is a better way to go as
  new issues arise. As Navina said Kafka is a stream system and Samza is
 a
  stream processor and those two ideas should be mutually exclusive.
 
  -Jordan
 
  On Mon, Jul 13, 2015 at 10:06 AM, Jay Kreps jay.kr...@gmail.com wrote:
 
   Hmm, thought about this more. Maybe this is just too much too quick.
   Overall I think there is some enthusiasm for the proposal but it's not
   really unanimous enough to make any kind of change this big cleanly.
 The
   board doesn't really like the merging stuff, user's are concerned about
  

Re: Review Request 36089: SAMZA-670 Allow easier access to JMX port

2015-07-13 Thread Yan Fang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36089/#review91533
---



samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala (line 
634)
https://reviews.apache.org/r/36089/#comment144956

I think a better way, which requires much fewer changes, is to call 
something like jmxServer.getJmxUrl, jmxServer.jmxTunelingUrl.

jmxServer can be a variable of SamzaContainer Object.

Then we do not need to change ContainerModel, JobModel, SamzaContext. 
Because there is no reason that we want to contain jmx information into those 
three objects.


- Yan Fang


On July 1, 2015, 2:07 p.m., József Márton Jung wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/36089/
 ---
 
 (Updated July 1, 2015, 2:07 p.m.)
 
 
 Review request for samza.
 
 
 Repository: samza
 
 
 Description
 ---
 
 JMX address of application master and the containers is available through AM 
 UI
 
 
 Diffs
 -
 
   checkstyle/import-control.xml 3374f0c 
   
 samza-api/src/main/java/org/apache/samza/container/SamzaContainerContext.java 
 fd7333b 
   samza-core/src/main/java/org/apache/samza/container/LocalityManager.java 
 e661e12 
   
 samza-core/src/main/java/org/apache/samza/coordinator/stream/CoordinatorStreamMessage.java
  6c1e488 
   samza-core/src/main/java/org/apache/samza/job/model/ContainerModel.java 
 98a34bc 
   samza-core/src/main/java/org/apache/samza/job/model/JobModel.java 95a2ce5 
   samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
 cbacd18 
   samza-core/src/main/scala/org/apache/samza/coordinator/JobCoordinator.scala 
 8ee034a 
   samza-core/src/main/scala/org/apache/samza/metrics/JmxServer.scala f343faf 
   
 samza-core/src/test/scala/org/apache/samza/container/TestSamzaContainer.scala 
 9fb1aa9 
   samza-core/src/test/scala/org/apache/samza/container/TestTaskInstance.scala 
 7caad28 
   
 samza-test/src/main/scala/org/apache/samza/test/performance/TestKeyValuePerformance.scala
  1ce7d25 
   samza-yarn/src/main/resources/scalate/WEB-INF/views/index.scaml cf0d2fc 
   samza-yarn/src/main/scala/org/apache/samza/job/yarn/SamzaAppMaster.scala 
 20aa373 
   
 samza-yarn/src/main/scala/org/apache/samza/job/yarn/SamzaAppMasterState.scala 
 1445605 
 
 Diff: https://reviews.apache.org/r/36089/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 József Márton Jung
 




Re: Thoughts and obesrvations on Samza

2015-07-13 Thread Yi Pan
Hi, Jay,

Given all the user concerns, the board disagreement on sub-projects, I am
supporting your 5th option as well. As you said, even the end goal is the
same, it might help to pave a smooth path forward. One thing I learned over
the years is that what we planned for may not be the final product, and the
unexpected product may be even better if we learn and adapt along the way.
:)

So, since I assume that in option 5, Samza will fully embrace the new Kafka
Streams API as the core and heavily depends on it, I want to raise up some
detailed logistic questions:
1. How do Samza community contribute to the design and development of the
new Kafka Streams API? As Kartik mentioned, if there is a model for Samza
community to contribute to just this part of Kafka code base, it would be a
huge plus point to the integration.
2. What's the scope of the new Kafka Streams API? Is it just focused on
message consumption, producing, Kafka-based partition distribution, offset
management, message selection and delivery to StreamProcessor? In other
words, I have a question regarding to whether we should put samza-kv-store
in the scope? The reasons that I think that it might be better to stay in
Samza initially are: a) KV-store libraries does not directly interact w/
Kafka brokers, it optionally uses Kafka consumers and producers like a
client program; b) there are a tons of experiments / tune-ups on RocksDB
that we want to have a faster iteration on this library (e.g. there is an
experimental time-sequence KV store implementation from LinkedIn we also
want to experiment on in window operator in SQL). The down-side I can see
is that w/o this in Kafka Streams API, the as-a-library mode may not get
the state management support. If we can find a way to make sure that the
current Samza community can contribute to this library in a faster
velocity, I can be convinced otherwise as well. What's your opinion on this?

Overall, thanks a lot for pushing forward the whole discussion!

-Yi

On Mon, Jul 13, 2015 at 12:56 PM, Garry Turkington 
g.turking...@improvedigital.com wrote:

 Hi,

 I'm also supportive of Jay's option 5. There is a risk the transformer
 API -- I'd have preferred Metamorphosis but it's too hard to type! --
 takes on a life of its own and we end up with two very different things but
 given how good the Kafka community has been at introducing new producer and
 consumer clients and giving very clear guidance on when they are production
 ready this is a danger I believe can be managed. It'd also be excellent to
 get some working code to beat around the notions of stream processing atop
 a system with transacdtional messages.

 On the question of whether to keep or deprecate SystemConsumer/Producer I
 believe we need get a better understanding over the next while of just what
 the Samza community is looking for in such connectivity. For my own use
 cases I have been looking to add additional implementations primarily to
 use Samza as the data ingress and egress component around Kafka. Writing
 external clients that require their own reliability and scalability
 management gets old real fast and pushing this into a simple Samza job that
 reads from system X and pushes into Kafka (or vice versa) was the obvious
 choice for me in the current model. For this type of usage though copycat
 is likely much superior (obviously needs proven) and the question then is
 if most Samza users look to the system implementations to also act as a
 front-end into Kafka or if significant usage is indeed intended to have the
 alternative systems as the primary message source. That understanding will
 I think give much clarity in just what value the abstraction overhead of
 the current model brings.

 Garry

 -Original Message-
 From: Yan Fang [mailto:yanfang...@gmail.com]
 Sent: 13 July 2015 19:58
 To: dev@samza.apache.org
 Subject: Re: Thoughts and obesrvations on Samza

 I am leaning to Jay's fifth approach. It is not radical and gives us some
 time to see the outcome.

 In addition, I would suggest:

 1) Keep the SystemConsumer/SystemProducer API. Because current
 SystemConsumer/SystemProducer API satisfies the usage (From Joardan, and
 even Garry's feedback) and is not so broken that we want to deprecate it.
 Though there are some issues in implemnting the Kinesis, they are not
 unfixable. Nothing should prevent Samza, as a stream processing system, to
 support other systems. In addition, there already are some systems
 exiting besides Kafka: ElasticSearch (committed to the master), HDFS
 (patch-available), S3( from the mailing list), Kinesis (developing in
 another repository), ActiveMQ (in two months). We may want to see how those
 go before we kill them.

 2) Can have some Samza devs involved in Kafka's transformer client API.
 This can not only help the future integration (if any) much easier, because
 they have knowledge about both systems, but also good for Kafka's
 community, because Samza devs have the streaming process 

Review Request 36471: added stream for auto scaling, consumer to read from the stream in profiler, sliding window metric

2015-07-13 Thread Shadi A. Noghabi

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36471/
---

Review request for samza and Navina Ramesh.


Repository: samza


Description
---

Autoscaling for samza (work in progress)

This work is for SAMZA-719. Currently, a fixed number of containers is assigned 
to a job as an input configuration parameter. However, with this design jobs 
can fail due to lack of enough resources (such as memory), or they can become a 
bottleneck in a workflow containing many jobs. While auto-scaling is much 
broader term, the goal of this project will be to enable a Samza job to 
automatically scale its containers such that there is improved job performance.

Based on the design, we need a profiler, analyser, optimizer and deployer 
module.

-currently profiler and analyzer added. 
-tests not added for those components, and further testing is needed.


Diffs
-

  checkstyle/import-control.xml 3374f0c432e61ac4cda275377604cfd481f0cddf 
  samza-api/src/main/java/org/apache/samza/autoScaling/AutoScalingMode.java 
PRE-CREATION 
  samza-api/src/main/java/org/apache/samza/autoScaling/Profiler.java 
PRE-CREATION 
  samza-api/src/main/java/org/apache/samza/autoScaling/analyzer/Analyzer.java 
PRE-CREATION 
  samza-core/src/main/java/org/apache/samza/autoScaling/AutoScalingSystem.java 
PRE-CREATION 
  
samza-core/src/main/java/org/apache/samza/autoScaling/SnapshotReporterProfiler.java
 PRE-CREATION 
  
samza-core/src/main/java/org/apache/samza/autoScaling/analyzer/MemoryAnalyzer.java
 PRE-CREATION 
  
samza-core/src/main/java/org/apache/samza/autoScaling/metrics/MemoryMetrics.java
 PRE-CREATION 
  
samza-core/src/main/java/org/apache/samza/autoScaling/metrics/SlidingWindowMetric.java
 PRE-CREATION 
  
samza-core/src/main/java/org/apache/samza/autoScaling/stream/AutoScalingMetricsSystemConsumer.java
 PRE-CREATION 
  
samza-core/src/main/scala/org/apache/samza/autoScaling/stream/AutoScalingMetricsSystemFactory.scala
 PRE-CREATION 
  
samza-core/src/main/scala/org/apache/samza/config/AutoScalingConfigRewriter.scala
 PRE-CREATION 
  samza-core/src/main/scala/org/apache/samza/config/JobConfig.scala 
e4b14f4da6649eb78753ba3b3f529373b6f2dbe4 
  samza-core/src/main/scala/org/apache/samza/metrics/JvmMetrics.scala 
a95a0ecde300f6576fe46b37d5898e3d21634126 
  samza-core/src/main/scala/org/apache/samza/util/Util.scala 
2feb65b729b45fbc3b83a75c4072527e3c4e60be 
  
samza-core/src/test/java/org/apache/samza/autoScaling/SlidingWindowMetricTest.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/36471/diff/


Testing
---


Thanks,

Shadi A. Noghabi



Re: Question about sub-projects and project merging

2015-07-13 Thread Niclas Hedhman
From peanut gallery;

  a. It looks to me that there is no overwhelming reason to merge the
communities. In fact, IF it already was a single community, it might be
time to split Samza out. Ask this question; If the active Samza devs lay
down their tools, how many Kafka devs would care about (and further the dev
of) Samza?

  b. Having hard dependency on another upstream project is common place
in ASF. Take a look at the Hadoop echo system for many examples.

  c. To me, it sounds more like a technical issue of design, where Samza is
more flexible than needed, perhaps because the original intent was to allow
integration with more messaging systems than Kafka. Redesigning seems to be
a driver, and that doesn't need to lead to merging the communities.

  d. Is there actually other underlying community issue? I haven't seen any
worrying signs from Board reports, but I am asking anyway... These kind of
questions often surface when the most active members of the community feel
somewhat burned out and looking for other active devs to help out.


Cheers
Niclas

On Mon, Jul 13, 2015 at 8:37 AM, Jay Kreps jay.kr...@gmail.com wrote:

 Hey board members,

 There is a longish thread on the Apache Samza mailing list on the
 relationship between Kafka and Samza and whether they wouldn't make a lot
 more sense as a single project. This raised some questions I was hoping to
 get advice on.

 Discussion thread (warning: super long, I attempt to summarize relevant
 bits below):

 http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3ccabyby7d_-jcxj7fizsjuebjedgbep33flyx3nrozt0yeox9...@mail.gmail.com%3E

 Anyhow, some people thought Apache has lot's of sub-projects, that would
 be a graceful way to step in the right direction. At that point others
 popped up and said, sub-projects are discouraged by the board.

 I'm not sure if we understand technically what a subproject is, but I
 think it means a second repo/committership under the same PMC.

 A few questions:
 - Is that what a sub-project is?
 - Are they discouraged? If so, why?
 - Assuming it makes sense in this case what is the process for making one?
 - Putting aside sub-projects as a mechanism what are examples where
 communities merged successfully? We were pointed towards Lucene/SOLR. Are
 there others?

 Relevant background info:
 - Samza depends on Kafka, but not vice versa
 - There is some overlap in committers but not extensive (3/11 Samza
 committers are also Kafka committers)

 Thanks for the advice!

 -Jay






-- 
Niclas Hedhman, Software Developer
http://zest.apache.org - New Energy for Java


Re: Question about sub-projects and project merging

2015-07-13 Thread Hervé Boutemy
some remarks on what a sub-project is? taken from my experience working on 
this exact topic for https://projects.apache.org/

first: see facts at https://projects.apache.org/projects.html?pmc for a 
complete list of projects (as documented by PMCs, then there are a lot of 
software that is not described) grouped by PMCs.

I came to the conclusion that this is a question of semantic around project 
term, with 2 competing visions at ASF:
- either you talk of TLPs + sub-projects
- or you talk about committees + projects

After trying both visions for https://projects.apache.org/ , which started on 
the TLP + sub-projects vision because TLP is pretty much used by all of us, 
I finally preferred committees + projects since it avoided the question of 
classifying projects in Top Level Projects and sub-projects, with the bad 
impression it puts on sub-ones, and the fact that in some committees, there 
is no project that is more top or sub: see Commons or Logging.
But for some committees, there is really a main project and other projects are 
more like extensions or plugin: see Ant or Velocity

IMHO, talking about committees and projects is the best way to avoid bad 
passion that comes from TLPs + sub-projects vision.

With that terms, your question of merging 2 TLPs becomes merging 2 
committees, ie their communities, and putting 2 projects under the management 
of this merged committee: IMHO, the description is more verbose but the 
debate is less passionated and focused on the main question = is this really 
the same community, then that should be managed by one committee only?


I don't have any opinion on Kafka and Samza case: I just hope these 
explanations will help for the discussion.

Regards,

Hervé

Le dimanche 12 juillet 2015 22:37:55 Jay Kreps a écrit :
 Hey board members,
 
 There is a longish thread on the Apache Samza mailing list on the
 relationship between Kafka and Samza and whether they wouldn't make a lot
 more sense as a single project. This raised some questions I was hoping to
 get advice on.
 
 Discussion thread (warning: super long, I attempt to summarize relevant
 bits below):
 http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCABYbY7d_- 
 jcxj7fizsjuebjedgbep33flyx3nrozt0yeox9...@mail.gmail.com%3E
 
 Anyhow, some people thought Apache has lot's of sub-projects, that would
 be a graceful way to step in the right direction. At that point others
 popped up and said, sub-projects are discouraged by the board.
 
 I'm not sure if we understand technically what a subproject is, but I think
 it means a second repo/committership under the same PMC.
 
 A few questions:
 - Is that what a sub-project is?
 - Are they discouraged? If so, why?
 - Assuming it makes sense in this case what is the process for making one?
 - Putting aside sub-projects as a mechanism what are examples where
 communities merged successfully? We were pointed towards Lucene/SOLR. Are
 there others?
 
 Relevant background info:
 - Samza depends on Kafka, but not vice versa
 - There is some overlap in committers but not extensive (3/11 Samza
 committers are also Kafka committers)
 
 Thanks for the advice!
 
 -Jay