Re: Thoughts and obesrvations on Samza

2015-07-10 Thread Jay Kreps
Yeah totally agree. I think you have this issue even today, right? I.e. if
you need to make a simple config change and you're running in YARN today
you end up bouncing the job which then rebuilds state. I think the fix is
exactly what you described which is to have a long timeout on partition
movement for stateful jobs so that if a job is just getting bounced, and
the cluster manager (or admin) is smart enough to restart it on the same
host when possible, it can optimistically reuse any existing state it finds
on disk (if it is valid).

So in this model the charter of the CM is to place processes as stickily as
possible and to restart or re-place failed processes. The charter of the
partition management system is to control the assignment of work to these
processes. The nice thing about this is that the work assignment, timeouts,
behavior, configs, and code will all be the same across all cluster
managers.

So I think that prototype would actually give you exactly what you want
today for any cluster manager (or manual placement + restart script) that
was sticky in terms of host placement since there is already a configurable
partition movement timeout and task-by-task state reuse with a check on
state validity.

-Jay

On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover roger.hoo...@gmail.com
wrote:

 That would be great to let Kafka do as much heavy lifting as possible and
 make it easier for other languages to implement Samza apis.

 One thing to watch out for is the interplay between Kafka's group
 management and the external scheduler/process manager's fault tolerance.
 If a container dies, the Kafka group membership protocol will try to assign
 it's tasks to other containers while at the same time the process manager
 is trying to relaunch the container.  Without some consideration for this
 (like a configurable amount of time to wait before Kafka alters the group
 membership), there may be thrashing going on which is especially bad for
 containers with large amounts of local state.

 Someone else pointed this out already but I thought it might be worth
 calling out again.

 Cheers,

 Roger


 On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps j...@confluent.io wrote:

  Hey Roger,
 
  I couldn't agree more. We spent a bunch of time talking to people and
 that
  is exactly the stuff we heard time and again. What makes it hard, of
  course, is that there is some tension between compatibility with what's
  there now and making things better for new users.
 
  I also strongly agree with the importance of multi-language support. We
 are
  talking now about Java, but for application development use cases people
  want to work in whatever language they are using elsewhere. I think
 moving
  to a model where Kafka itself does the group membership, lifecycle
 control,
  and partition assignment has the advantage of putting all that complex
  stuff behind a clean api that the clients are already going to be
  implementing for their consumer, so the added functionality for stream
  processing beyond a consumer becomes very minor.
 
  -Jay
 
  On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover roger.hoo...@gmail.com
  wrote:
 
   Metamorphosis...nice. :)
  
   This has been a great discussion.  As a user of Samza who's recently
   integrated it into a relatively large organization, I just want to add
   support to a few points already made.
  
   The biggest hurdles to adoption of Samza as it currently exists that
 I've
   experienced are:
   1) YARN - YARN is overly complex in many environments where Puppet
 would
  do
   just fine but it was the only mechanism to get fault tolerance.
   2) Configuration - I think I like the idea of configuring most of the
 job
   in code rather than config files.  In general, I think the goal should
 be
   to make it harder to make mistakes, especially of the kind where the
 code
   expects something and the config doesn't match.  The current config is
   quite intricate and error-prone.  For example, the application logic
 may
   depend on bootstrapping a topic but rather than asserting that in the
  code,
   you have to rely on getting the config right.  Likewise with serdes,
 the
   Java representations produced by various serdes (JSON, Avro, etc.) are
  not
   equivalent so you cannot just reconfigure a serde without changing the
   code.   It would be nice for jobs to be able to assert what they expect
   from their input topics in terms of partitioning.  This is getting a
  little
   off topic but I was even thinking about creating a Samza config
 linter
   that would sanity check a set of configs.  Especially in organizations
   where config is managed by a different team than the application
  developer,
   it's very hard to get avoid config mistakes.
   3) Java/Scala centric - for many teams (especially DevOps-type folks),
  the
   pain of the Java toolchain (maven, slow builds, weak command line
  support,
   configuration over convention) really inhibits productivity.  As more
 and
   

Re: Thoughts and obesrvations on Samza

2015-07-10 Thread Roger Hoover
That would be great to let Kafka do as much heavy lifting as possible and
make it easier for other languages to implement Samza apis.

One thing to watch out for is the interplay between Kafka's group
management and the external scheduler/process manager's fault tolerance.
If a container dies, the Kafka group membership protocol will try to assign
it's tasks to other containers while at the same time the process manager
is trying to relaunch the container.  Without some consideration for this
(like a configurable amount of time to wait before Kafka alters the group
membership), there may be thrashing going on which is especially bad for
containers with large amounts of local state.

Someone else pointed this out already but I thought it might be worth
calling out again.

Cheers,

Roger


On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps j...@confluent.io wrote:

 Hey Roger,

 I couldn't agree more. We spent a bunch of time talking to people and that
 is exactly the stuff we heard time and again. What makes it hard, of
 course, is that there is some tension between compatibility with what's
 there now and making things better for new users.

 I also strongly agree with the importance of multi-language support. We are
 talking now about Java, but for application development use cases people
 want to work in whatever language they are using elsewhere. I think moving
 to a model where Kafka itself does the group membership, lifecycle control,
 and partition assignment has the advantage of putting all that complex
 stuff behind a clean api that the clients are already going to be
 implementing for their consumer, so the added functionality for stream
 processing beyond a consumer becomes very minor.

 -Jay

 On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover roger.hoo...@gmail.com
 wrote:

  Metamorphosis...nice. :)
 
  This has been a great discussion.  As a user of Samza who's recently
  integrated it into a relatively large organization, I just want to add
  support to a few points already made.
 
  The biggest hurdles to adoption of Samza as it currently exists that I've
  experienced are:
  1) YARN - YARN is overly complex in many environments where Puppet would
 do
  just fine but it was the only mechanism to get fault tolerance.
  2) Configuration - I think I like the idea of configuring most of the job
  in code rather than config files.  In general, I think the goal should be
  to make it harder to make mistakes, especially of the kind where the code
  expects something and the config doesn't match.  The current config is
  quite intricate and error-prone.  For example, the application logic may
  depend on bootstrapping a topic but rather than asserting that in the
 code,
  you have to rely on getting the config right.  Likewise with serdes, the
  Java representations produced by various serdes (JSON, Avro, etc.) are
 not
  equivalent so you cannot just reconfigure a serde without changing the
  code.   It would be nice for jobs to be able to assert what they expect
  from their input topics in terms of partitioning.  This is getting a
 little
  off topic but I was even thinking about creating a Samza config linter
  that would sanity check a set of configs.  Especially in organizations
  where config is managed by a different team than the application
 developer,
  it's very hard to get avoid config mistakes.
  3) Java/Scala centric - for many teams (especially DevOps-type folks),
 the
  pain of the Java toolchain (maven, slow builds, weak command line
 support,
  configuration over convention) really inhibits productivity.  As more and
  more high-quality clients become available for Kafka, I hope they'll
 follow
  Samza's model.  Not sure how much it affects the proposals in this thread
  but please consider other languages in the ecosystem as well.  From what
  I've heard, Spark has more Python users than Java/Scala.
  (FYI, we added a Jython wrapper for the Samza API
 
 
 https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
  and are working on a Yeoman generator
  https://github.com/Quantiply/generator-rico for Jython/Samza projects to
  alleviate some of the pain)
 
  I also want to underscore Jay's point about improving the user
 experience.
  That's a very important factor for adoption.  I think the goal should be
 to
  make Samza as easy to get started with as something like Logstash.
  Logstash is vastly inferior in terms of capabilities to Samza but it's
 easy
  to get started and that makes a big difference.
 
  Cheers,
 
  Roger
 
 
 
 
 
  On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci Morales 
  g...@apache.org wrote:
 
   Forgot to add. On the naming issues, Kafka Metamorphosis is a clear
  winner
   :)
  
   --
   Gianmarco
  
   On 7 July 2015 at 13:26, Gianmarco De Francisci Morales 
 g...@apache.org
  
   wrote:
  
Hi,
   
@Martin, thanks for you comments.
Maybe I'm missing some important point, but I think coupling the
  releases
is 

Re: Samza job on YARN stuck Unassigned

2015-07-10 Thread Gustavo Anatoly
Hi Krzysztof,

I had connectivity errors, but in my case was the /etc/hosts misconfigured.

Cheers.

2015-07-10 12:11 GMT-03:00 Roger Hoover roger.hoo...@gmail.com:

 Hi Krzysztof,

 I haven't seen that error before.  It does sound like it could be a
 connection issue.  Did you check that the YARN node has access
 to hdfs:///user/samza/deploy/event-log-etl-nested-0.1.0-dist.tar.gz?

 One way to set the AM and containers to debug is to include a log4j.xml
 file in your tar.gz on the lib folder.  There special logic in the start
 scripts (

 https://github.com/apache/samza/blob/master/samza-shell/src/main/bash/run-class.sh#L40
 )
 that checks for that path and doesn't work with log4j.properties, for
 example.

 Cheers,

 Roger



 On Fri, Jul 10, 2015 at 4:18 AM, Krzysztof Zarzycki k.zarzy...@gmail.com
 wrote:

  Hi there Samza developers,
 
  I have a problem that I cannot overcome with deploying Samza task on
 YARN.
  When I submitted the task, ApplicationMasters get created (2 of them),
 job
  is visible, but in state UNASSIGNED. After some time the job FAILED.
 
  application information on resource manager panel is :
  State: FAILED
  FinalStatus: FAILED
  Elapsed: 25mins, 2sec
  Diagnostics: Application application_1424354741837_0380 failed 2 times
 due
  to ApplicationMaster for attempt appattempt_1424354741837_0380_02
 timed
  out. Failing the application.
 
 
  When I look into the logs of ApplicationMaster I see no errors, no
  warnings, anything wrong: Please see the output of yarn logs comand
  attached.
 
  My guess would be that connection failed between some components
  (container to ApplicationMaster? NodeManager? ).  I suspect that when
  looking at jstack output in the AM:
 
  main #1 prio=5 os_prio=0 tid=0x7f9338015000 nid=0x6f2f waiting on
  condition [0x7f933de6e000]
 java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
 
 org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43)
at
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:154)
at com.sun.proxy.$Proxy18.registerApplicationMaster(Unknown Source)
at
 
 org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:196)
at
 
 org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
at
 
 org.apache.samza.job.yarn.SamzaAppMasterLifecycle.onInit(SamzaAppMasterLifecycle.scala:39)
at
 
 org.apache.samza.job.yarn.SamzaAppMaster$$anonfun$run$1.apply(SamzaAppMaster.scala:108)
at
 
 org.apache.samza.job.yarn.SamzaAppMaster$$anonfun$run$1.apply(SamzaAppMaster.scala:108)
at scala.collection.immutable.List.foreach(List.scala:318)
at
  org.apache.samza.job.yarn.SamzaAppMaster$.run(SamzaAppMaster.scala:108)
at
  org.apache.samza.job.yarn.SamzaAppMaster$.main(SamzaAppMaster.scala:95)
at org.apache.samza.job.yarn.SamzaAppMaster.main(SamzaAppMaster.scala)
 
 
  On the other hand I see in logs correct RM addresses:
  15/07/10 12:17:30 INFO client.RMProxy: Connecting to ResourceManager at
  hdnn02.company.com/148.251.82.11:8030
  15/07/10 12:17:31 INFO client.RMProxy: Connecting to ResourceManager at
  hdnn02.company.com/148.251.82.11:8050
  ...
  2015-07-10 12:17:31,032 [main] INFO  o.apache.samza.job.yarn.ClientHelper
  - trying to connect to RM hdnn02.company.com:8050
  ...
  2015-07-10 12:17:31,680 [main] INFO  o.a.s.job.yarn.SamzaAppMasterService
  - Webapp is started at (rpc http://78.46.56.88:43268/, tracking http://
 
 
  Does anyone knows what could be wrong here? I'll be grateful for any
 help,
  also in just debugging the case.
  I start with a simple question: do you know how to set log4j for AM 
  containers to DEBUG?
 
  Thank you!
  Krzysztof
 
 
 



Review Request 36398: SAMZA-714: update release doc in 0.9.1

2015-07-10 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36398/
---

Review request for samza, Yan Fang, Chinmay Soman, Chris Riccomini, and Navina 
Ramesh.


Bugs: SAMZA-714
https://issues.apache.org/jira/browse/SAMZA-714


Repository: samza


Description
---

SAMZA-714: update release doc in 0.9.1


Diffs
-

  docs/startup/hello-samza/versioned/index.md 
507ceb51e239fb873ee210d909d427ffd21955e3 

Diff: https://reviews.apache.org/r/36398/diff/


Testing
---

Verified w/ RB #36384 on samza-hello-samza


Thanks,

Yi Pan (Data Infrastructure)



Re: Thoughts and obesrvations on Samza

2015-07-10 Thread Jakob Homan
  This leads me to thinking that merging projects and communities might be a 
 good idea: with the union of experience from both communities, we will 
 probably build a better system that is better for users.
Is this what's being proposed though? Merging the projects seems like
a consequence of at most one of the three directions under discussion:
1) Samza 2.0: The Samza community relies more heavily on Kafka for
configuration, etc. (to a greater or lesser extent to be determined)
but the Samza community would not automatically merge withe Kafka
community (the Phoenix/HBase example is a good one here).
2) Samza Reboot: The Samza community continues to exist with a limited
project scope, but similarly would not need to be part of the Kafka
community (ie given committership) to progress.  Here, maybe the Samza
team would become a subproject of Kafka (the Board frowns on
subprojects at the moment, so I'm not sure if that's even feasible),
but that would not be required.
3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka
team builds its own streaming library, possibly off of Jay's
prototype, which has not direct lineage to the Samza team.  There's no
reason for the Kafka team to bring in the Samza team.

Is the Kafka community on board with this?

To be clear, all three options under discussion are interesting,
technically valid and likely healthy directions for the project.
Also, they are not mutually exclusive.  The Samza community could
decide to pursue, say, 'Samza 2.0', while the Kafka community went
forward with 'Hey Samza!'  My points above are directed entirely at
the community aspect of these choices.
-Jakob

On 10 July 2015 at 09:10, Roger Hoover roger.hoo...@gmail.com wrote:
 That's great.  Thanks, Jay.

 On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps j...@confluent.io wrote:

 Yeah totally agree. I think you have this issue even today, right? I.e. if
 you need to make a simple config change and you're running in YARN today
 you end up bouncing the job which then rebuilds state. I think the fix is
 exactly what you described which is to have a long timeout on partition
 movement for stateful jobs so that if a job is just getting bounced, and
 the cluster manager (or admin) is smart enough to restart it on the same
 host when possible, it can optimistically reuse any existing state it finds
 on disk (if it is valid).

 So in this model the charter of the CM is to place processes as stickily as
 possible and to restart or re-place failed processes. The charter of the
 partition management system is to control the assignment of work to these
 processes. The nice thing about this is that the work assignment, timeouts,
 behavior, configs, and code will all be the same across all cluster
 managers.

 So I think that prototype would actually give you exactly what you want
 today for any cluster manager (or manual placement + restart script) that
 was sticky in terms of host placement since there is already a configurable
 partition movement timeout and task-by-task state reuse with a check on
 state validity.

 -Jay

 On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover roger.hoo...@gmail.com
 wrote:

  That would be great to let Kafka do as much heavy lifting as possible and
  make it easier for other languages to implement Samza apis.
 
  One thing to watch out for is the interplay between Kafka's group
  management and the external scheduler/process manager's fault tolerance.
  If a container dies, the Kafka group membership protocol will try to
 assign
  it's tasks to other containers while at the same time the process manager
  is trying to relaunch the container.  Without some consideration for this
  (like a configurable amount of time to wait before Kafka alters the group
  membership), there may be thrashing going on which is especially bad for
  containers with large amounts of local state.
 
  Someone else pointed this out already but I thought it might be worth
  calling out again.
 
  Cheers,
 
  Roger
 
 
  On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps j...@confluent.io wrote:
 
   Hey Roger,
  
   I couldn't agree more. We spent a bunch of time talking to people and
  that
   is exactly the stuff we heard time and again. What makes it hard, of
   course, is that there is some tension between compatibility with what's
   there now and making things better for new users.
  
   I also strongly agree with the importance of multi-language support. We
  are
   talking now about Java, but for application development use cases
 people
   want to work in whatever language they are using elsewhere. I think
  moving
   to a model where Kafka itself does the group membership, lifecycle
  control,
   and partition assignment has the advantage of putting all that complex
   stuff behind a clean api that the clients are already going to be
   implementing for their consumer, so the added functionality for stream
   processing beyond a consumer becomes very minor.
  
   -Jay
  
   On Tue, Jul 7, 2015 

Re: Review Request 35397: Fix SAMZA-697

2015-07-10 Thread Guozhang Wang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35397/
---

(Updated July 10, 2015, 4:54 p.m.)


Review request for samza.


Summary (updated)
-

Fix SAMZA-697


Bugs: SAMZA-697
https://issues.apache.org/jira/browse/SAMZA-697


Repository: samza


Description (updated)
---

Add missing files


Diffs (updated)
-

  checkstyle/import-control.xml 3374f0c432e61ac4cda275377604cfd481f0cddf 
  docs/learn/documentation/versioned/jobs/configuration-table.html 
405e2cea4fd1d037cc26b3537f6bb406eded202b 
  samza-core/src/main/java/org/apache/samza/task/TaskClassLoader.java 
PRE-CREATION 
  samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 
0b3a235b5ab1d6bd60669bfe6023f6b0b4e943d3 
  samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala 
cbacd183420e9d1d72b05693b55a8f0a62d59fc5 
  samza-core/src/main/scala/org/apache/samza/container/TaskInstance.scala 
c5a5ea5dea9a950fc741625238f5bf8b1f362180 
  samza-core/src/main/scala/org/apache/samza/job/JobRunner.scala 
1c178a661e449c6bdfc4ce431aef9bb2d261a6c2 
  samza-core/src/main/scala/org/apache/samza/job/local/ProcessJobFactory.scala 
4fac154709d72ab594485dad93c912b55fb1617e 
  samza-core/src/test/java/org/apache/samza/task/TestTaskClassLoader.java 
PRE-CREATION 
  samza-core/src/test/scala/org/apache/samza/container/TestSamzaContainer.scala 
9fb1aa98fcd14397e8a4cb00c67537482e95fa53 
  samza-core/src/test/scala/org/apache/samza/container/TestTaskInstance.scala 
7caad28c9298485753ab861da76793cf925953ed 

Diff: https://reviews.apache.org/r/35397/diff/


Testing
---

unit tests


Thanks,

Guozhang Wang



Re: Thoughts and obesrvations on Samza

2015-07-10 Thread Roger Hoover
That's great.  Thanks, Jay.

On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps j...@confluent.io wrote:

 Yeah totally agree. I think you have this issue even today, right? I.e. if
 you need to make a simple config change and you're running in YARN today
 you end up bouncing the job which then rebuilds state. I think the fix is
 exactly what you described which is to have a long timeout on partition
 movement for stateful jobs so that if a job is just getting bounced, and
 the cluster manager (or admin) is smart enough to restart it on the same
 host when possible, it can optimistically reuse any existing state it finds
 on disk (if it is valid).

 So in this model the charter of the CM is to place processes as stickily as
 possible and to restart or re-place failed processes. The charter of the
 partition management system is to control the assignment of work to these
 processes. The nice thing about this is that the work assignment, timeouts,
 behavior, configs, and code will all be the same across all cluster
 managers.

 So I think that prototype would actually give you exactly what you want
 today for any cluster manager (or manual placement + restart script) that
 was sticky in terms of host placement since there is already a configurable
 partition movement timeout and task-by-task state reuse with a check on
 state validity.

 -Jay

 On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover roger.hoo...@gmail.com
 wrote:

  That would be great to let Kafka do as much heavy lifting as possible and
  make it easier for other languages to implement Samza apis.
 
  One thing to watch out for is the interplay between Kafka's group
  management and the external scheduler/process manager's fault tolerance.
  If a container dies, the Kafka group membership protocol will try to
 assign
  it's tasks to other containers while at the same time the process manager
  is trying to relaunch the container.  Without some consideration for this
  (like a configurable amount of time to wait before Kafka alters the group
  membership), there may be thrashing going on which is especially bad for
  containers with large amounts of local state.
 
  Someone else pointed this out already but I thought it might be worth
  calling out again.
 
  Cheers,
 
  Roger
 
 
  On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps j...@confluent.io wrote:
 
   Hey Roger,
  
   I couldn't agree more. We spent a bunch of time talking to people and
  that
   is exactly the stuff we heard time and again. What makes it hard, of
   course, is that there is some tension between compatibility with what's
   there now and making things better for new users.
  
   I also strongly agree with the importance of multi-language support. We
  are
   talking now about Java, but for application development use cases
 people
   want to work in whatever language they are using elsewhere. I think
  moving
   to a model where Kafka itself does the group membership, lifecycle
  control,
   and partition assignment has the advantage of putting all that complex
   stuff behind a clean api that the clients are already going to be
   implementing for their consumer, so the added functionality for stream
   processing beyond a consumer becomes very minor.
  
   -Jay
  
   On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover roger.hoo...@gmail.com
   wrote:
  
Metamorphosis...nice. :)
   
This has been a great discussion.  As a user of Samza who's recently
integrated it into a relatively large organization, I just want to
 add
support to a few points already made.
   
The biggest hurdles to adoption of Samza as it currently exists that
  I've
experienced are:
1) YARN - YARN is overly complex in many environments where Puppet
  would
   do
just fine but it was the only mechanism to get fault tolerance.
2) Configuration - I think I like the idea of configuring most of the
  job
in code rather than config files.  In general, I think the goal
 should
  be
to make it harder to make mistakes, especially of the kind where the
  code
expects something and the config doesn't match.  The current config
 is
quite intricate and error-prone.  For example, the application logic
  may
depend on bootstrapping a topic but rather than asserting that in the
   code,
you have to rely on getting the config right.  Likewise with serdes,
  the
Java representations produced by various serdes (JSON, Avro, etc.)
 are
   not
equivalent so you cannot just reconfigure a serde without changing
 the
code.   It would be nice for jobs to be able to assert what they
 expect
from their input topics in terms of partitioning.  This is getting a
   little
off topic but I was even thinking about creating a Samza config
  linter
that would sanity check a set of configs.  Especially in
 organizations
where config is managed by a different team than the application
   developer,
it's very hard to get avoid config mistakes.
3) Java/Scala centric 

Re: Thoughts and obesrvations on Samza

2015-07-10 Thread Jay Kreps
Yeah I agree with this summary. I think there are kind of two questions
here:
1. Technically does alignment/reliance on Kafka make sense
2. Branding wise (naming, website, concepts, etc) does alignment with Kafka
make sense

Personally I do think both of these things would be really valuable, and
would dramatically alter the trajectory of the project.

My preference would be to see if people can mostly agree on a direction
rather than splintering things off. From my point of view the ideal outcome
of all the options discussed would be to make Samza a closely aligned
subproject, maintained in a separate repository and retaining the existing
committership but sharing as much else as possible (website, etc). No idea
about how these things work, Jacob, you probably know more.

No discussion amongst the Kafka folks has happened on this, but likely we
should figure out what the Samza community actually wants first.

I admit that this is a fairly radical departure from how things are.

If that doesn't fly, I think, yeah we could leave Samza as it is and do the
more radical reboot inside Kafka. From my point of view that does leave
things in a somewhat confusing state since now there are two stream
processing systems more or less coupled to Kafka in large part made by the
same people. But, arguably that might be a cleaner way to make the cut-over
and perhaps less risky for Samza community since if it works people can
switch and if it doesn't nothing will have changed. Dunno, how do people
feel about this?

-Jay

On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan jgho...@gmail.com wrote:

   This leads me to thinking that merging projects and communities might
 be a good idea: with the union of experience from both communities, we will
 probably build a better system that is better for users.
 Is this what's being proposed though? Merging the projects seems like
 a consequence of at most one of the three directions under discussion:
 1) Samza 2.0: The Samza community relies more heavily on Kafka for
 configuration, etc. (to a greater or lesser extent to be determined)
 but the Samza community would not automatically merge withe Kafka
 community (the Phoenix/HBase example is a good one here).
 2) Samza Reboot: The Samza community continues to exist with a limited
 project scope, but similarly would not need to be part of the Kafka
 community (ie given committership) to progress.  Here, maybe the Samza
 team would become a subproject of Kafka (the Board frowns on
 subprojects at the moment, so I'm not sure if that's even feasible),
 but that would not be required.
 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka
 team builds its own streaming library, possibly off of Jay's
 prototype, which has not direct lineage to the Samza team.  There's no
 reason for the Kafka team to bring in the Samza team.

 Is the Kafka community on board with this?

 To be clear, all three options under discussion are interesting,
 technically valid and likely healthy directions for the project.
 Also, they are not mutually exclusive.  The Samza community could
 decide to pursue, say, 'Samza 2.0', while the Kafka community went
 forward with 'Hey Samza!'  My points above are directed entirely at
 the community aspect of these choices.
 -Jakob

 On 10 July 2015 at 09:10, Roger Hoover roger.hoo...@gmail.com wrote:
  That's great.  Thanks, Jay.
 
  On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps j...@confluent.io wrote:
 
  Yeah totally agree. I think you have this issue even today, right? I.e.
 if
  you need to make a simple config change and you're running in YARN today
  you end up bouncing the job which then rebuilds state. I think the fix
 is
  exactly what you described which is to have a long timeout on partition
  movement for stateful jobs so that if a job is just getting bounced, and
  the cluster manager (or admin) is smart enough to restart it on the same
  host when possible, it can optimistically reuse any existing state it
 finds
  on disk (if it is valid).
 
  So in this model the charter of the CM is to place processes as
 stickily as
  possible and to restart or re-place failed processes. The charter of the
  partition management system is to control the assignment of work to
 these
  processes. The nice thing about this is that the work assignment,
 timeouts,
  behavior, configs, and code will all be the same across all cluster
  managers.
 
  So I think that prototype would actually give you exactly what you want
  today for any cluster manager (or manual placement + restart script)
 that
  was sticky in terms of host placement since there is already a
 configurable
  partition movement timeout and task-by-task state reuse with a check on
  state validity.
 
  -Jay
 
  On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover roger.hoo...@gmail.com
  wrote:
 
   That would be great to let Kafka do as much heavy lifting as possible
 and
   make it easier for other languages to implement Samza apis.
  
   One thing to watch 

Re: Thoughts and obesrvations on Samza

2015-07-10 Thread Yan Fang
Thanks, Jay. This argument persuaded me actually. :)

Fang, Yan
yanfang...@gmail.com

On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps j...@confluent.io wrote:

 Hey Yan,

 Yeah philosophically I think the argument is that you should capture the
 stream in Kafka independent of the transformation. This is obviously a
 Kafka-centric view point.

 Advantages of this:
 - In practice I think this is what e.g. Storm people often end up doing
 anyway. You usually need to throttle any access to a live serving database.
 - Can have multiple subscribers and they get the same thing without
 additional load on the source system.
 - Applications can tap into the stream if need be by subscribing.
 - You can debug your transformation by tailing the Kafka topic with the
 console consumer
 - Can tee off the same data stream for batch analysis or Lambda arch style
 re-processing

 The disadvantage is that it will use Kafka resources. But the idea is
 eventually you will have multiple subscribers to any data source (at least
 for monitoring) so you will end up there soon enough anyway.

 Down the road the technical benefit is that I think it gives us a good path
 towards end-to-end exactly once semantics from source to destination.
 Basically the connectors need to support idempotence when talking to Kafka
 and we need the transactional write feature in Kafka to make the
 transformation atomic. This is actually pretty doable if you separate
 connector=kafka problem from the generic transformations which are always
 kafka=kafka. However I think it is quite impossible to do in a all_things
 = all_things environment. Today you can say well the semantics of the
 Samza APIs depend on the connectors you use but it is actually worse then
 that because the semantics actually depend on the pairing of connectors--so
 not only can you probably not get a usable exactly once guarantee
 end-to-end it can actually be quite hard to reverse engineer what property
 (if any) your end-to-end flow has if you have heterogenous systems.

 -Jay

 On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang yanfang...@gmail.com wrote:

  {quote}
  maintained in a separate repository and retaining the existing
  committership but sharing as much else as possible (website, etc)
  {quote}
 
  Overall, I agree on this idea. Now the question is more about how to do
  it.
 
  On the other hand, one thing I want to point out is that, if we decide to
  go this way, how do we want to support
  otherSystem-transformation-otherSystem use case?
 
  Basically, there are four user groups here:
 
  1. Kafka-transformation-Kafka
  2. Kafka-transformation-otherSystem
  3. otherSystem-transformation-Kafka
  4. otherSystem-transformation-otherSystem
 
  For group 1, they can easily use the new Samza library to achieve. For
  group 2 and 3, they can use copyCat - transformation - Kafka or Kafka-
  transformation - copyCat.
 
  The problem is for group 4. Do we want to abandon this or still support
 it?
  Of course, this use case can be achieved by using copyCat -
 transformation
  - Kafka - transformation - copyCat, the thing is how we persuade them
 to
  do this long chain. If yes, it will also be a win for Kafka too. Or if
  there is no one in this community actually doing this so far, maybe ok to
  not support the group 4 directly.
 
  Thanks,
 
  Fang, Yan
  yanfang...@gmail.com
 
  On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps j...@confluent.io wrote:
 
   Yeah I agree with this summary. I think there are kind of two questions
   here:
   1. Technically does alignment/reliance on Kafka make sense
   2. Branding wise (naming, website, concepts, etc) does alignment with
  Kafka
   make sense
  
   Personally I do think both of these things would be really valuable,
 and
   would dramatically alter the trajectory of the project.
  
   My preference would be to see if people can mostly agree on a direction
   rather than splintering things off. From my point of view the ideal
  outcome
   of all the options discussed would be to make Samza a closely aligned
   subproject, maintained in a separate repository and retaining the
  existing
   committership but sharing as much else as possible (website, etc). No
  idea
   about how these things work, Jacob, you probably know more.
  
   No discussion amongst the Kafka folks has happened on this, but likely
 we
   should figure out what the Samza community actually wants first.
  
   I admit that this is a fairly radical departure from how things are.
  
   If that doesn't fly, I think, yeah we could leave Samza as it is and do
  the
   more radical reboot inside Kafka. From my point of view that does leave
   things in a somewhat confusing state since now there are two stream
   processing systems more or less coupled to Kafka in large part made by
  the
   same people. But, arguably that might be a cleaner way to make the
  cut-over
   and perhaps less risky for Samza community since if it works people can
   switch and if it doesn't nothing will 

Review Request 36405: SAMZA-714: Doc publish for 0.9.1 release

2015-07-10 Thread Yi Pan (Data Infrastructure)

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36405/
---

Review request for samza, Yan Fang, Chinmay Soman, Chris Riccomini, Navina 
Ramesh, and Naveen Somasundaram.


Repository: samza


Description
---

SAMZA-714: Doc publish for 0.9.1 release in master branch


Diffs
-

  docs/README.md 63f03e7933718581df45514d5b8f872c77897351 
  docs/startup/download/index.md ccaf20b4b4ee4d5885fbc0906d37470ec8a31e6e 

Diff: https://reviews.apache.org/r/36405/diff/


Testing
---


Thanks,

Yi Pan (Data Infrastructure)



Re: Thoughts and obesrvations on Samza

2015-07-10 Thread Julian Hyde
I broadly support it, with one big proviso.

One of the attractive things about Kafka has been its minimalism --
the fact that it solves one part of the problem, simply, and very
well. It is very important that it continues to do that, and that
people continue to perceive it that way. Make Kafka into a stack, if
you must, but make sure that the architectural layers remain clear.

Making perfectly good software components into a muddled stack is one
of the pitfalls of success, and in particular a pitfall of being a
software company with a sales team looking for a bigger pieces to sell
and customers looking for simpler large components to install. Avoid
at all costs!


On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps j...@confluent.io wrote:
 Yeah I agree with this summary. I think there are kind of two questions
 here:
 1. Technically does alignment/reliance on Kafka make sense
 2. Branding wise (naming, website, concepts, etc) does alignment with Kafka
 make sense

 Personally I do think both of these things would be really valuable, and
 would dramatically alter the trajectory of the project.

 My preference would be to see if people can mostly agree on a direction
 rather than splintering things off. From my point of view the ideal outcome
 of all the options discussed would be to make Samza a closely aligned
 subproject, maintained in a separate repository and retaining the existing
 committership but sharing as much else as possible (website, etc). No idea
 about how these things work, Jacob, you probably know more.

 No discussion amongst the Kafka folks has happened on this, but likely we
 should figure out what the Samza community actually wants first.

 I admit that this is a fairly radical departure from how things are.

 If that doesn't fly, I think, yeah we could leave Samza as it is and do the
 more radical reboot inside Kafka. From my point of view that does leave
 things in a somewhat confusing state since now there are two stream
 processing systems more or less coupled to Kafka in large part made by the
 same people. But, arguably that might be a cleaner way to make the cut-over
 and perhaps less risky for Samza community since if it works people can
 switch and if it doesn't nothing will have changed. Dunno, how do people
 feel about this?

 -Jay

 On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan jgho...@gmail.com wrote:

   This leads me to thinking that merging projects and communities might
 be a good idea: with the union of experience from both communities, we will
 probably build a better system that is better for users.
 Is this what's being proposed though? Merging the projects seems like
 a consequence of at most one of the three directions under discussion:
 1) Samza 2.0: The Samza community relies more heavily on Kafka for
 configuration, etc. (to a greater or lesser extent to be determined)
 but the Samza community would not automatically merge withe Kafka
 community (the Phoenix/HBase example is a good one here).
 2) Samza Reboot: The Samza community continues to exist with a limited
 project scope, but similarly would not need to be part of the Kafka
 community (ie given committership) to progress.  Here, maybe the Samza
 team would become a subproject of Kafka (the Board frowns on
 subprojects at the moment, so I'm not sure if that's even feasible),
 but that would not be required.
 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka
 team builds its own streaming library, possibly off of Jay's
 prototype, which has not direct lineage to the Samza team.  There's no
 reason for the Kafka team to bring in the Samza team.

 Is the Kafka community on board with this?

 To be clear, all three options under discussion are interesting,
 technically valid and likely healthy directions for the project.
 Also, they are not mutually exclusive.  The Samza community could
 decide to pursue, say, 'Samza 2.0', while the Kafka community went
 forward with 'Hey Samza!'  My points above are directed entirely at
 the community aspect of these choices.
 -Jakob

 On 10 July 2015 at 09:10, Roger Hoover roger.hoo...@gmail.com wrote:
  That's great.  Thanks, Jay.
 
  On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps j...@confluent.io wrote:
 
  Yeah totally agree. I think you have this issue even today, right? I.e.
 if
  you need to make a simple config change and you're running in YARN today
  you end up bouncing the job which then rebuilds state. I think the fix
 is
  exactly what you described which is to have a long timeout on partition
  movement for stateful jobs so that if a job is just getting bounced, and
  the cluster manager (or admin) is smart enough to restart it on the same
  host when possible, it can optimistically reuse any existing state it
 finds
  on disk (if it is valid).
 
  So in this model the charter of the CM is to place processes as
 stickily as
  possible and to restart or re-place failed processes. The charter of the
  partition management system is to control the 

Re: Question on newBlockingQueue in BlockingEnvelopeMap

2015-07-10 Thread Yan Fang
Hi Jae,

I think the messages are not lost, instead, they all go to one partition,
in your shared queue implementation.

If you check the code in BlockingEnvelopeMap line 123
https://github.com/apache/samza/blob/master/samza-api/src/main/java/org/apache/samza/util/BlockingEnvelopeMap.java#L123
,
it puts all the messages in the queue in one partition.

Thanks,

Fang, Yan
yanfang...@gmail.com

On Fri, Jul 10, 2015 at 12:36 PM, Bae, Jae Hyeon metac...@gmail.com wrote:

 Hi Samza devs and users

 I wrote customized Samza S3 consumer which downloads files from S3 and put
 messages in BlockedEnvelopeMap. It was straightforward because there's a
 nice example, filereader. I tried to a little optimize with
 newBlockingQueue() method because I guess that single queue shared could be
 fine because Samza container is single threaded. I added the following
 code:


 public S3Consumer(String systemName, Config config, MetricsRegistry
 registry) {
 queueSize = config.getInt(systems. + systemName + .queue.size,
 1);
 bucket = config.get(systems. + systemName + .bucket);
 prefix = config.get(systems. + systemName + .prefix);

 queue = new LinkedBlockingQueue(queueSize);

 recordCounter = registry.newCounter(this.getClass().getName(),
 processed_records);
 }

 @Override
 protected BlockingQueueIncomingMessageEnvelope newBlockingQueue() {
 return queue; // single queue
 }

 Unfortunately, I observed significant message loss with this
 implementation. I suspected its queue might have dropped messages, so I
 changed newBlockingQueue() implementation same as filereader.

 @Override
 protected BlockingQueueIncomingMessageEnvelope newBlockingQueue() {
 return new LinkedBlockingQueue(queueSize);
 }

 Then, message loss didn't happen again.

 Do you have any idea why it went wrong?

 Thank you
 Best, Jae



Re: Thoughts and obesrvations on Samza

2015-07-10 Thread Yan Fang
{quote}
maintained in a separate repository and retaining the existing
committership but sharing as much else as possible (website, etc)
{quote}

Overall, I agree on this idea. Now the question is more about how to do
it.

On the other hand, one thing I want to point out is that, if we decide to
go this way, how do we want to support
otherSystem-transformation-otherSystem use case?

Basically, there are four user groups here:

1. Kafka-transformation-Kafka
2. Kafka-transformation-otherSystem
3. otherSystem-transformation-Kafka
4. otherSystem-transformation-otherSystem

For group 1, they can easily use the new Samza library to achieve. For
group 2 and 3, they can use copyCat - transformation - Kafka or Kafka-
transformation - copyCat.

The problem is for group 4. Do we want to abandon this or still support it?
Of course, this use case can be achieved by using copyCat - transformation
- Kafka - transformation - copyCat, the thing is how we persuade them to
do this long chain. If yes, it will also be a win for Kafka too. Or if
there is no one in this community actually doing this so far, maybe ok to
not support the group 4 directly.

Thanks,

Fang, Yan
yanfang...@gmail.com

On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps j...@confluent.io wrote:

 Yeah I agree with this summary. I think there are kind of two questions
 here:
 1. Technically does alignment/reliance on Kafka make sense
 2. Branding wise (naming, website, concepts, etc) does alignment with Kafka
 make sense

 Personally I do think both of these things would be really valuable, and
 would dramatically alter the trajectory of the project.

 My preference would be to see if people can mostly agree on a direction
 rather than splintering things off. From my point of view the ideal outcome
 of all the options discussed would be to make Samza a closely aligned
 subproject, maintained in a separate repository and retaining the existing
 committership but sharing as much else as possible (website, etc). No idea
 about how these things work, Jacob, you probably know more.

 No discussion amongst the Kafka folks has happened on this, but likely we
 should figure out what the Samza community actually wants first.

 I admit that this is a fairly radical departure from how things are.

 If that doesn't fly, I think, yeah we could leave Samza as it is and do the
 more radical reboot inside Kafka. From my point of view that does leave
 things in a somewhat confusing state since now there are two stream
 processing systems more or less coupled to Kafka in large part made by the
 same people. But, arguably that might be a cleaner way to make the cut-over
 and perhaps less risky for Samza community since if it works people can
 switch and if it doesn't nothing will have changed. Dunno, how do people
 feel about this?

 -Jay

 On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan jgho...@gmail.com wrote:

This leads me to thinking that merging projects and communities might
  be a good idea: with the union of experience from both communities, we
 will
  probably build a better system that is better for users.
  Is this what's being proposed though? Merging the projects seems like
  a consequence of at most one of the three directions under discussion:
  1) Samza 2.0: The Samza community relies more heavily on Kafka for
  configuration, etc. (to a greater or lesser extent to be determined)
  but the Samza community would not automatically merge withe Kafka
  community (the Phoenix/HBase example is a good one here).
  2) Samza Reboot: The Samza community continues to exist with a limited
  project scope, but similarly would not need to be part of the Kafka
  community (ie given committership) to progress.  Here, maybe the Samza
  team would become a subproject of Kafka (the Board frowns on
  subprojects at the moment, so I'm not sure if that's even feasible),
  but that would not be required.
  3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka
  team builds its own streaming library, possibly off of Jay's
  prototype, which has not direct lineage to the Samza team.  There's no
  reason for the Kafka team to bring in the Samza team.
 
  Is the Kafka community on board with this?
 
  To be clear, all three options under discussion are interesting,
  technically valid and likely healthy directions for the project.
  Also, they are not mutually exclusive.  The Samza community could
  decide to pursue, say, 'Samza 2.0', while the Kafka community went
  forward with 'Hey Samza!'  My points above are directed entirely at
  the community aspect of these choices.
  -Jakob
 
  On 10 July 2015 at 09:10, Roger Hoover roger.hoo...@gmail.com wrote:
   That's great.  Thanks, Jay.
  
   On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps j...@confluent.io wrote:
  
   Yeah totally agree. I think you have this issue even today, right?
 I.e.
  if
   you need to make a simple config change and you're running in YARN
 today
   you end up bouncing the job which then rebuilds state. 

Re: Question on newBlockingQueue in BlockingEnvelopeMap

2015-07-10 Thread Bae, Jae Hyeon
I expected it should have not lost messages but it did. After I fixed
overridden method, it was fixed.

Anyway, thanks a lot for responding.

On Fri, Jul 10, 2015 at 2:11 PM, Yan Fang yanfang...@gmail.com wrote:

 Hi Jae,

 I think the messages are not lost, instead, they all go to one partition,
 in your shared queue implementation.

 If you check the code in BlockingEnvelopeMap line 123
 
 https://github.com/apache/samza/blob/master/samza-api/src/main/java/org/apache/samza/util/BlockingEnvelopeMap.java#L123
 
 ,
 it puts all the messages in the queue in one partition.

 Thanks,

 Fang, Yan
 yanfang...@gmail.com

 On Fri, Jul 10, 2015 at 12:36 PM, Bae, Jae Hyeon metac...@gmail.com
 wrote:

  Hi Samza devs and users
 
  I wrote customized Samza S3 consumer which downloads files from S3 and
 put
  messages in BlockedEnvelopeMap. It was straightforward because there's a
  nice example, filereader. I tried to a little optimize with
  newBlockingQueue() method because I guess that single queue shared could
 be
  fine because Samza container is single threaded. I added the following
  code:
 
 
  public S3Consumer(String systemName, Config config, MetricsRegistry
  registry) {
  queueSize = config.getInt(systems. + systemName +
 .queue.size,
  1);
  bucket = config.get(systems. + systemName + .bucket);
  prefix = config.get(systems. + systemName + .prefix);
 
  queue = new LinkedBlockingQueue(queueSize);
 
  recordCounter = registry.newCounter(this.getClass().getName(),
  processed_records);
  }
 
  @Override
  protected BlockingQueueIncomingMessageEnvelope newBlockingQueue() {
  return queue; // single queue
  }
 
  Unfortunately, I observed significant message loss with this
  implementation. I suspected its queue might have dropped messages, so I
  changed newBlockingQueue() implementation same as filereader.
 
  @Override
  protected BlockingQueueIncomingMessageEnvelope newBlockingQueue() {
  return new LinkedBlockingQueue(queueSize);
  }
 
  Then, message loss didn't happen again.
 
  Do you have any idea why it went wrong?
 
  Thank you
  Best, Jae