Re: Thoughts and obesrvations on Samza
Yeah totally agree. I think you have this issue even today, right? I.e. if you need to make a simple config change and you're running in YARN today you end up bouncing the job which then rebuilds state. I think the fix is exactly what you described which is to have a long timeout on partition movement for stateful jobs so that if a job is just getting bounced, and the cluster manager (or admin) is smart enough to restart it on the same host when possible, it can optimistically reuse any existing state it finds on disk (if it is valid). So in this model the charter of the CM is to place processes as stickily as possible and to restart or re-place failed processes. The charter of the partition management system is to control the assignment of work to these processes. The nice thing about this is that the work assignment, timeouts, behavior, configs, and code will all be the same across all cluster managers. So I think that prototype would actually give you exactly what you want today for any cluster manager (or manual placement + restart script) that was sticky in terms of host placement since there is already a configurable partition movement timeout and task-by-task state reuse with a check on state validity. -Jay On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover roger.hoo...@gmail.com wrote: That would be great to let Kafka do as much heavy lifting as possible and make it easier for other languages to implement Samza apis. One thing to watch out for is the interplay between Kafka's group management and the external scheduler/process manager's fault tolerance. If a container dies, the Kafka group membership protocol will try to assign it's tasks to other containers while at the same time the process manager is trying to relaunch the container. Without some consideration for this (like a configurable amount of time to wait before Kafka alters the group membership), there may be thrashing going on which is especially bad for containers with large amounts of local state. Someone else pointed this out already but I thought it might be worth calling out again. Cheers, Roger On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps j...@confluent.io wrote: Hey Roger, I couldn't agree more. We spent a bunch of time talking to people and that is exactly the stuff we heard time and again. What makes it hard, of course, is that there is some tension between compatibility with what's there now and making things better for new users. I also strongly agree with the importance of multi-language support. We are talking now about Java, but for application development use cases people want to work in whatever language they are using elsewhere. I think moving to a model where Kafka itself does the group membership, lifecycle control, and partition assignment has the advantage of putting all that complex stuff behind a clean api that the clients are already going to be implementing for their consumer, so the added functionality for stream processing beyond a consumer becomes very minor. -Jay On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover roger.hoo...@gmail.com wrote: Metamorphosis...nice. :) This has been a great discussion. As a user of Samza who's recently integrated it into a relatively large organization, I just want to add support to a few points already made. The biggest hurdles to adoption of Samza as it currently exists that I've experienced are: 1) YARN - YARN is overly complex in many environments where Puppet would do just fine but it was the only mechanism to get fault tolerance. 2) Configuration - I think I like the idea of configuring most of the job in code rather than config files. In general, I think the goal should be to make it harder to make mistakes, especially of the kind where the code expects something and the config doesn't match. The current config is quite intricate and error-prone. For example, the application logic may depend on bootstrapping a topic but rather than asserting that in the code, you have to rely on getting the config right. Likewise with serdes, the Java representations produced by various serdes (JSON, Avro, etc.) are not equivalent so you cannot just reconfigure a serde without changing the code. It would be nice for jobs to be able to assert what they expect from their input topics in terms of partitioning. This is getting a little off topic but I was even thinking about creating a Samza config linter that would sanity check a set of configs. Especially in organizations where config is managed by a different team than the application developer, it's very hard to get avoid config mistakes. 3) Java/Scala centric - for many teams (especially DevOps-type folks), the pain of the Java toolchain (maven, slow builds, weak command line support, configuration over convention) really inhibits productivity. As more and
Re: Thoughts and obesrvations on Samza
That would be great to let Kafka do as much heavy lifting as possible and make it easier for other languages to implement Samza apis. One thing to watch out for is the interplay between Kafka's group management and the external scheduler/process manager's fault tolerance. If a container dies, the Kafka group membership protocol will try to assign it's tasks to other containers while at the same time the process manager is trying to relaunch the container. Without some consideration for this (like a configurable amount of time to wait before Kafka alters the group membership), there may be thrashing going on which is especially bad for containers with large amounts of local state. Someone else pointed this out already but I thought it might be worth calling out again. Cheers, Roger On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps j...@confluent.io wrote: Hey Roger, I couldn't agree more. We spent a bunch of time talking to people and that is exactly the stuff we heard time and again. What makes it hard, of course, is that there is some tension between compatibility with what's there now and making things better for new users. I also strongly agree with the importance of multi-language support. We are talking now about Java, but for application development use cases people want to work in whatever language they are using elsewhere. I think moving to a model where Kafka itself does the group membership, lifecycle control, and partition assignment has the advantage of putting all that complex stuff behind a clean api that the clients are already going to be implementing for their consumer, so the added functionality for stream processing beyond a consumer becomes very minor. -Jay On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover roger.hoo...@gmail.com wrote: Metamorphosis...nice. :) This has been a great discussion. As a user of Samza who's recently integrated it into a relatively large organization, I just want to add support to a few points already made. The biggest hurdles to adoption of Samza as it currently exists that I've experienced are: 1) YARN - YARN is overly complex in many environments where Puppet would do just fine but it was the only mechanism to get fault tolerance. 2) Configuration - I think I like the idea of configuring most of the job in code rather than config files. In general, I think the goal should be to make it harder to make mistakes, especially of the kind where the code expects something and the config doesn't match. The current config is quite intricate and error-prone. For example, the application logic may depend on bootstrapping a topic but rather than asserting that in the code, you have to rely on getting the config right. Likewise with serdes, the Java representations produced by various serdes (JSON, Avro, etc.) are not equivalent so you cannot just reconfigure a serde without changing the code. It would be nice for jobs to be able to assert what they expect from their input topics in terms of partitioning. This is getting a little off topic but I was even thinking about creating a Samza config linter that would sanity check a set of configs. Especially in organizations where config is managed by a different team than the application developer, it's very hard to get avoid config mistakes. 3) Java/Scala centric - for many teams (especially DevOps-type folks), the pain of the Java toolchain (maven, slow builds, weak command line support, configuration over convention) really inhibits productivity. As more and more high-quality clients become available for Kafka, I hope they'll follow Samza's model. Not sure how much it affects the proposals in this thread but please consider other languages in the ecosystem as well. From what I've heard, Spark has more Python users than Java/Scala. (FYI, we added a Jython wrapper for the Samza API https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza and are working on a Yeoman generator https://github.com/Quantiply/generator-rico for Jython/Samza projects to alleviate some of the pain) I also want to underscore Jay's point about improving the user experience. That's a very important factor for adoption. I think the goal should be to make Samza as easy to get started with as something like Logstash. Logstash is vastly inferior in terms of capabilities to Samza but it's easy to get started and that makes a big difference. Cheers, Roger On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci Morales g...@apache.org wrote: Forgot to add. On the naming issues, Kafka Metamorphosis is a clear winner :) -- Gianmarco On 7 July 2015 at 13:26, Gianmarco De Francisci Morales g...@apache.org wrote: Hi, @Martin, thanks for you comments. Maybe I'm missing some important point, but I think coupling the releases is
Re: Samza job on YARN stuck Unassigned
Hi Krzysztof, I had connectivity errors, but in my case was the /etc/hosts misconfigured. Cheers. 2015-07-10 12:11 GMT-03:00 Roger Hoover roger.hoo...@gmail.com: Hi Krzysztof, I haven't seen that error before. It does sound like it could be a connection issue. Did you check that the YARN node has access to hdfs:///user/samza/deploy/event-log-etl-nested-0.1.0-dist.tar.gz? One way to set the AM and containers to debug is to include a log4j.xml file in your tar.gz on the lib folder. There special logic in the start scripts ( https://github.com/apache/samza/blob/master/samza-shell/src/main/bash/run-class.sh#L40 ) that checks for that path and doesn't work with log4j.properties, for example. Cheers, Roger On Fri, Jul 10, 2015 at 4:18 AM, Krzysztof Zarzycki k.zarzy...@gmail.com wrote: Hi there Samza developers, I have a problem that I cannot overcome with deploying Samza task on YARN. When I submitted the task, ApplicationMasters get created (2 of them), job is visible, but in state UNASSIGNED. After some time the job FAILED. application information on resource manager panel is : State: FAILED FinalStatus: FAILED Elapsed: 25mins, 2sec Diagnostics: Application application_1424354741837_0380 failed 2 times due to ApplicationMaster for attempt appattempt_1424354741837_0380_02 timed out. Failing the application. When I look into the logs of ApplicationMaster I see no errors, no warnings, anything wrong: Please see the output of yarn logs comand attached. My guess would be that connection failed between some components (container to ApplicationMaster? NodeManager? ). I suspect that when looking at jstack output in the AM: main #1 prio=5 os_prio=0 tid=0x7f9338015000 nid=0x6f2f waiting on condition [0x7f933de6e000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:154) at com.sun.proxy.$Proxy18.registerApplicationMaster(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:196) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138) at org.apache.samza.job.yarn.SamzaAppMasterLifecycle.onInit(SamzaAppMasterLifecycle.scala:39) at org.apache.samza.job.yarn.SamzaAppMaster$$anonfun$run$1.apply(SamzaAppMaster.scala:108) at org.apache.samza.job.yarn.SamzaAppMaster$$anonfun$run$1.apply(SamzaAppMaster.scala:108) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.samza.job.yarn.SamzaAppMaster$.run(SamzaAppMaster.scala:108) at org.apache.samza.job.yarn.SamzaAppMaster$.main(SamzaAppMaster.scala:95) at org.apache.samza.job.yarn.SamzaAppMaster.main(SamzaAppMaster.scala) On the other hand I see in logs correct RM addresses: 15/07/10 12:17:30 INFO client.RMProxy: Connecting to ResourceManager at hdnn02.company.com/148.251.82.11:8030 15/07/10 12:17:31 INFO client.RMProxy: Connecting to ResourceManager at hdnn02.company.com/148.251.82.11:8050 ... 2015-07-10 12:17:31,032 [main] INFO o.apache.samza.job.yarn.ClientHelper - trying to connect to RM hdnn02.company.com:8050 ... 2015-07-10 12:17:31,680 [main] INFO o.a.s.job.yarn.SamzaAppMasterService - Webapp is started at (rpc http://78.46.56.88:43268/, tracking http:// Does anyone knows what could be wrong here? I'll be grateful for any help, also in just debugging the case. I start with a simple question: do you know how to set log4j for AM containers to DEBUG? Thank you! Krzysztof
Review Request 36398: SAMZA-714: update release doc in 0.9.1
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36398/ --- Review request for samza, Yan Fang, Chinmay Soman, Chris Riccomini, and Navina Ramesh. Bugs: SAMZA-714 https://issues.apache.org/jira/browse/SAMZA-714 Repository: samza Description --- SAMZA-714: update release doc in 0.9.1 Diffs - docs/startup/hello-samza/versioned/index.md 507ceb51e239fb873ee210d909d427ffd21955e3 Diff: https://reviews.apache.org/r/36398/diff/ Testing --- Verified w/ RB #36384 on samza-hello-samza Thanks, Yi Pan (Data Infrastructure)
Re: Thoughts and obesrvations on Samza
This leads me to thinking that merging projects and communities might be a good idea: with the union of experience from both communities, we will probably build a better system that is better for users. Is this what's being proposed though? Merging the projects seems like a consequence of at most one of the three directions under discussion: 1) Samza 2.0: The Samza community relies more heavily on Kafka for configuration, etc. (to a greater or lesser extent to be determined) but the Samza community would not automatically merge withe Kafka community (the Phoenix/HBase example is a good one here). 2) Samza Reboot: The Samza community continues to exist with a limited project scope, but similarly would not need to be part of the Kafka community (ie given committership) to progress. Here, maybe the Samza team would become a subproject of Kafka (the Board frowns on subprojects at the moment, so I'm not sure if that's even feasible), but that would not be required. 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka team builds its own streaming library, possibly off of Jay's prototype, which has not direct lineage to the Samza team. There's no reason for the Kafka team to bring in the Samza team. Is the Kafka community on board with this? To be clear, all three options under discussion are interesting, technically valid and likely healthy directions for the project. Also, they are not mutually exclusive. The Samza community could decide to pursue, say, 'Samza 2.0', while the Kafka community went forward with 'Hey Samza!' My points above are directed entirely at the community aspect of these choices. -Jakob On 10 July 2015 at 09:10, Roger Hoover roger.hoo...@gmail.com wrote: That's great. Thanks, Jay. On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps j...@confluent.io wrote: Yeah totally agree. I think you have this issue even today, right? I.e. if you need to make a simple config change and you're running in YARN today you end up bouncing the job which then rebuilds state. I think the fix is exactly what you described which is to have a long timeout on partition movement for stateful jobs so that if a job is just getting bounced, and the cluster manager (or admin) is smart enough to restart it on the same host when possible, it can optimistically reuse any existing state it finds on disk (if it is valid). So in this model the charter of the CM is to place processes as stickily as possible and to restart or re-place failed processes. The charter of the partition management system is to control the assignment of work to these processes. The nice thing about this is that the work assignment, timeouts, behavior, configs, and code will all be the same across all cluster managers. So I think that prototype would actually give you exactly what you want today for any cluster manager (or manual placement + restart script) that was sticky in terms of host placement since there is already a configurable partition movement timeout and task-by-task state reuse with a check on state validity. -Jay On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover roger.hoo...@gmail.com wrote: That would be great to let Kafka do as much heavy lifting as possible and make it easier for other languages to implement Samza apis. One thing to watch out for is the interplay between Kafka's group management and the external scheduler/process manager's fault tolerance. If a container dies, the Kafka group membership protocol will try to assign it's tasks to other containers while at the same time the process manager is trying to relaunch the container. Without some consideration for this (like a configurable amount of time to wait before Kafka alters the group membership), there may be thrashing going on which is especially bad for containers with large amounts of local state. Someone else pointed this out already but I thought it might be worth calling out again. Cheers, Roger On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps j...@confluent.io wrote: Hey Roger, I couldn't agree more. We spent a bunch of time talking to people and that is exactly the stuff we heard time and again. What makes it hard, of course, is that there is some tension between compatibility with what's there now and making things better for new users. I also strongly agree with the importance of multi-language support. We are talking now about Java, but for application development use cases people want to work in whatever language they are using elsewhere. I think moving to a model where Kafka itself does the group membership, lifecycle control, and partition assignment has the advantage of putting all that complex stuff behind a clean api that the clients are already going to be implementing for their consumer, so the added functionality for stream processing beyond a consumer becomes very minor. -Jay On Tue, Jul 7, 2015
Re: Review Request 35397: Fix SAMZA-697
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/35397/ --- (Updated July 10, 2015, 4:54 p.m.) Review request for samza. Summary (updated) - Fix SAMZA-697 Bugs: SAMZA-697 https://issues.apache.org/jira/browse/SAMZA-697 Repository: samza Description (updated) --- Add missing files Diffs (updated) - checkstyle/import-control.xml 3374f0c432e61ac4cda275377604cfd481f0cddf docs/learn/documentation/versioned/jobs/configuration-table.html 405e2cea4fd1d037cc26b3537f6bb406eded202b samza-core/src/main/java/org/apache/samza/task/TaskClassLoader.java PRE-CREATION samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 0b3a235b5ab1d6bd60669bfe6023f6b0b4e943d3 samza-core/src/main/scala/org/apache/samza/container/SamzaContainer.scala cbacd183420e9d1d72b05693b55a8f0a62d59fc5 samza-core/src/main/scala/org/apache/samza/container/TaskInstance.scala c5a5ea5dea9a950fc741625238f5bf8b1f362180 samza-core/src/main/scala/org/apache/samza/job/JobRunner.scala 1c178a661e449c6bdfc4ce431aef9bb2d261a6c2 samza-core/src/main/scala/org/apache/samza/job/local/ProcessJobFactory.scala 4fac154709d72ab594485dad93c912b55fb1617e samza-core/src/test/java/org/apache/samza/task/TestTaskClassLoader.java PRE-CREATION samza-core/src/test/scala/org/apache/samza/container/TestSamzaContainer.scala 9fb1aa98fcd14397e8a4cb00c67537482e95fa53 samza-core/src/test/scala/org/apache/samza/container/TestTaskInstance.scala 7caad28c9298485753ab861da76793cf925953ed Diff: https://reviews.apache.org/r/35397/diff/ Testing --- unit tests Thanks, Guozhang Wang
Re: Thoughts and obesrvations on Samza
That's great. Thanks, Jay. On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps j...@confluent.io wrote: Yeah totally agree. I think you have this issue even today, right? I.e. if you need to make a simple config change and you're running in YARN today you end up bouncing the job which then rebuilds state. I think the fix is exactly what you described which is to have a long timeout on partition movement for stateful jobs so that if a job is just getting bounced, and the cluster manager (or admin) is smart enough to restart it on the same host when possible, it can optimistically reuse any existing state it finds on disk (if it is valid). So in this model the charter of the CM is to place processes as stickily as possible and to restart or re-place failed processes. The charter of the partition management system is to control the assignment of work to these processes. The nice thing about this is that the work assignment, timeouts, behavior, configs, and code will all be the same across all cluster managers. So I think that prototype would actually give you exactly what you want today for any cluster manager (or manual placement + restart script) that was sticky in terms of host placement since there is already a configurable partition movement timeout and task-by-task state reuse with a check on state validity. -Jay On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover roger.hoo...@gmail.com wrote: That would be great to let Kafka do as much heavy lifting as possible and make it easier for other languages to implement Samza apis. One thing to watch out for is the interplay between Kafka's group management and the external scheduler/process manager's fault tolerance. If a container dies, the Kafka group membership protocol will try to assign it's tasks to other containers while at the same time the process manager is trying to relaunch the container. Without some consideration for this (like a configurable amount of time to wait before Kafka alters the group membership), there may be thrashing going on which is especially bad for containers with large amounts of local state. Someone else pointed this out already but I thought it might be worth calling out again. Cheers, Roger On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps j...@confluent.io wrote: Hey Roger, I couldn't agree more. We spent a bunch of time talking to people and that is exactly the stuff we heard time and again. What makes it hard, of course, is that there is some tension between compatibility with what's there now and making things better for new users. I also strongly agree with the importance of multi-language support. We are talking now about Java, but for application development use cases people want to work in whatever language they are using elsewhere. I think moving to a model where Kafka itself does the group membership, lifecycle control, and partition assignment has the advantage of putting all that complex stuff behind a clean api that the clients are already going to be implementing for their consumer, so the added functionality for stream processing beyond a consumer becomes very minor. -Jay On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover roger.hoo...@gmail.com wrote: Metamorphosis...nice. :) This has been a great discussion. As a user of Samza who's recently integrated it into a relatively large organization, I just want to add support to a few points already made. The biggest hurdles to adoption of Samza as it currently exists that I've experienced are: 1) YARN - YARN is overly complex in many environments where Puppet would do just fine but it was the only mechanism to get fault tolerance. 2) Configuration - I think I like the idea of configuring most of the job in code rather than config files. In general, I think the goal should be to make it harder to make mistakes, especially of the kind where the code expects something and the config doesn't match. The current config is quite intricate and error-prone. For example, the application logic may depend on bootstrapping a topic but rather than asserting that in the code, you have to rely on getting the config right. Likewise with serdes, the Java representations produced by various serdes (JSON, Avro, etc.) are not equivalent so you cannot just reconfigure a serde without changing the code. It would be nice for jobs to be able to assert what they expect from their input topics in terms of partitioning. This is getting a little off topic but I was even thinking about creating a Samza config linter that would sanity check a set of configs. Especially in organizations where config is managed by a different team than the application developer, it's very hard to get avoid config mistakes. 3) Java/Scala centric
Re: Thoughts and obesrvations on Samza
Yeah I agree with this summary. I think there are kind of two questions here: 1. Technically does alignment/reliance on Kafka make sense 2. Branding wise (naming, website, concepts, etc) does alignment with Kafka make sense Personally I do think both of these things would be really valuable, and would dramatically alter the trajectory of the project. My preference would be to see if people can mostly agree on a direction rather than splintering things off. From my point of view the ideal outcome of all the options discussed would be to make Samza a closely aligned subproject, maintained in a separate repository and retaining the existing committership but sharing as much else as possible (website, etc). No idea about how these things work, Jacob, you probably know more. No discussion amongst the Kafka folks has happened on this, but likely we should figure out what the Samza community actually wants first. I admit that this is a fairly radical departure from how things are. If that doesn't fly, I think, yeah we could leave Samza as it is and do the more radical reboot inside Kafka. From my point of view that does leave things in a somewhat confusing state since now there are two stream processing systems more or less coupled to Kafka in large part made by the same people. But, arguably that might be a cleaner way to make the cut-over and perhaps less risky for Samza community since if it works people can switch and if it doesn't nothing will have changed. Dunno, how do people feel about this? -Jay On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan jgho...@gmail.com wrote: This leads me to thinking that merging projects and communities might be a good idea: with the union of experience from both communities, we will probably build a better system that is better for users. Is this what's being proposed though? Merging the projects seems like a consequence of at most one of the three directions under discussion: 1) Samza 2.0: The Samza community relies more heavily on Kafka for configuration, etc. (to a greater or lesser extent to be determined) but the Samza community would not automatically merge withe Kafka community (the Phoenix/HBase example is a good one here). 2) Samza Reboot: The Samza community continues to exist with a limited project scope, but similarly would not need to be part of the Kafka community (ie given committership) to progress. Here, maybe the Samza team would become a subproject of Kafka (the Board frowns on subprojects at the moment, so I'm not sure if that's even feasible), but that would not be required. 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka team builds its own streaming library, possibly off of Jay's prototype, which has not direct lineage to the Samza team. There's no reason for the Kafka team to bring in the Samza team. Is the Kafka community on board with this? To be clear, all three options under discussion are interesting, technically valid and likely healthy directions for the project. Also, they are not mutually exclusive. The Samza community could decide to pursue, say, 'Samza 2.0', while the Kafka community went forward with 'Hey Samza!' My points above are directed entirely at the community aspect of these choices. -Jakob On 10 July 2015 at 09:10, Roger Hoover roger.hoo...@gmail.com wrote: That's great. Thanks, Jay. On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps j...@confluent.io wrote: Yeah totally agree. I think you have this issue even today, right? I.e. if you need to make a simple config change and you're running in YARN today you end up bouncing the job which then rebuilds state. I think the fix is exactly what you described which is to have a long timeout on partition movement for stateful jobs so that if a job is just getting bounced, and the cluster manager (or admin) is smart enough to restart it on the same host when possible, it can optimistically reuse any existing state it finds on disk (if it is valid). So in this model the charter of the CM is to place processes as stickily as possible and to restart or re-place failed processes. The charter of the partition management system is to control the assignment of work to these processes. The nice thing about this is that the work assignment, timeouts, behavior, configs, and code will all be the same across all cluster managers. So I think that prototype would actually give you exactly what you want today for any cluster manager (or manual placement + restart script) that was sticky in terms of host placement since there is already a configurable partition movement timeout and task-by-task state reuse with a check on state validity. -Jay On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover roger.hoo...@gmail.com wrote: That would be great to let Kafka do as much heavy lifting as possible and make it easier for other languages to implement Samza apis. One thing to watch
Re: Thoughts and obesrvations on Samza
Thanks, Jay. This argument persuaded me actually. :) Fang, Yan yanfang...@gmail.com On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps j...@confluent.io wrote: Hey Yan, Yeah philosophically I think the argument is that you should capture the stream in Kafka independent of the transformation. This is obviously a Kafka-centric view point. Advantages of this: - In practice I think this is what e.g. Storm people often end up doing anyway. You usually need to throttle any access to a live serving database. - Can have multiple subscribers and they get the same thing without additional load on the source system. - Applications can tap into the stream if need be by subscribing. - You can debug your transformation by tailing the Kafka topic with the console consumer - Can tee off the same data stream for batch analysis or Lambda arch style re-processing The disadvantage is that it will use Kafka resources. But the idea is eventually you will have multiple subscribers to any data source (at least for monitoring) so you will end up there soon enough anyway. Down the road the technical benefit is that I think it gives us a good path towards end-to-end exactly once semantics from source to destination. Basically the connectors need to support idempotence when talking to Kafka and we need the transactional write feature in Kafka to make the transformation atomic. This is actually pretty doable if you separate connector=kafka problem from the generic transformations which are always kafka=kafka. However I think it is quite impossible to do in a all_things = all_things environment. Today you can say well the semantics of the Samza APIs depend on the connectors you use but it is actually worse then that because the semantics actually depend on the pairing of connectors--so not only can you probably not get a usable exactly once guarantee end-to-end it can actually be quite hard to reverse engineer what property (if any) your end-to-end flow has if you have heterogenous systems. -Jay On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang yanfang...@gmail.com wrote: {quote} maintained in a separate repository and retaining the existing committership but sharing as much else as possible (website, etc) {quote} Overall, I agree on this idea. Now the question is more about how to do it. On the other hand, one thing I want to point out is that, if we decide to go this way, how do we want to support otherSystem-transformation-otherSystem use case? Basically, there are four user groups here: 1. Kafka-transformation-Kafka 2. Kafka-transformation-otherSystem 3. otherSystem-transformation-Kafka 4. otherSystem-transformation-otherSystem For group 1, they can easily use the new Samza library to achieve. For group 2 and 3, they can use copyCat - transformation - Kafka or Kafka- transformation - copyCat. The problem is for group 4. Do we want to abandon this or still support it? Of course, this use case can be achieved by using copyCat - transformation - Kafka - transformation - copyCat, the thing is how we persuade them to do this long chain. If yes, it will also be a win for Kafka too. Or if there is no one in this community actually doing this so far, maybe ok to not support the group 4 directly. Thanks, Fang, Yan yanfang...@gmail.com On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps j...@confluent.io wrote: Yeah I agree with this summary. I think there are kind of two questions here: 1. Technically does alignment/reliance on Kafka make sense 2. Branding wise (naming, website, concepts, etc) does alignment with Kafka make sense Personally I do think both of these things would be really valuable, and would dramatically alter the trajectory of the project. My preference would be to see if people can mostly agree on a direction rather than splintering things off. From my point of view the ideal outcome of all the options discussed would be to make Samza a closely aligned subproject, maintained in a separate repository and retaining the existing committership but sharing as much else as possible (website, etc). No idea about how these things work, Jacob, you probably know more. No discussion amongst the Kafka folks has happened on this, but likely we should figure out what the Samza community actually wants first. I admit that this is a fairly radical departure from how things are. If that doesn't fly, I think, yeah we could leave Samza as it is and do the more radical reboot inside Kafka. From my point of view that does leave things in a somewhat confusing state since now there are two stream processing systems more or less coupled to Kafka in large part made by the same people. But, arguably that might be a cleaner way to make the cut-over and perhaps less risky for Samza community since if it works people can switch and if it doesn't nothing will
Review Request 36405: SAMZA-714: Doc publish for 0.9.1 release
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36405/ --- Review request for samza, Yan Fang, Chinmay Soman, Chris Riccomini, Navina Ramesh, and Naveen Somasundaram. Repository: samza Description --- SAMZA-714: Doc publish for 0.9.1 release in master branch Diffs - docs/README.md 63f03e7933718581df45514d5b8f872c77897351 docs/startup/download/index.md ccaf20b4b4ee4d5885fbc0906d37470ec8a31e6e Diff: https://reviews.apache.org/r/36405/diff/ Testing --- Thanks, Yi Pan (Data Infrastructure)
Re: Thoughts and obesrvations on Samza
I broadly support it, with one big proviso. One of the attractive things about Kafka has been its minimalism -- the fact that it solves one part of the problem, simply, and very well. It is very important that it continues to do that, and that people continue to perceive it that way. Make Kafka into a stack, if you must, but make sure that the architectural layers remain clear. Making perfectly good software components into a muddled stack is one of the pitfalls of success, and in particular a pitfall of being a software company with a sales team looking for a bigger pieces to sell and customers looking for simpler large components to install. Avoid at all costs! On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps j...@confluent.io wrote: Yeah I agree with this summary. I think there are kind of two questions here: 1. Technically does alignment/reliance on Kafka make sense 2. Branding wise (naming, website, concepts, etc) does alignment with Kafka make sense Personally I do think both of these things would be really valuable, and would dramatically alter the trajectory of the project. My preference would be to see if people can mostly agree on a direction rather than splintering things off. From my point of view the ideal outcome of all the options discussed would be to make Samza a closely aligned subproject, maintained in a separate repository and retaining the existing committership but sharing as much else as possible (website, etc). No idea about how these things work, Jacob, you probably know more. No discussion amongst the Kafka folks has happened on this, but likely we should figure out what the Samza community actually wants first. I admit that this is a fairly radical departure from how things are. If that doesn't fly, I think, yeah we could leave Samza as it is and do the more radical reboot inside Kafka. From my point of view that does leave things in a somewhat confusing state since now there are two stream processing systems more or less coupled to Kafka in large part made by the same people. But, arguably that might be a cleaner way to make the cut-over and perhaps less risky for Samza community since if it works people can switch and if it doesn't nothing will have changed. Dunno, how do people feel about this? -Jay On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan jgho...@gmail.com wrote: This leads me to thinking that merging projects and communities might be a good idea: with the union of experience from both communities, we will probably build a better system that is better for users. Is this what's being proposed though? Merging the projects seems like a consequence of at most one of the three directions under discussion: 1) Samza 2.0: The Samza community relies more heavily on Kafka for configuration, etc. (to a greater or lesser extent to be determined) but the Samza community would not automatically merge withe Kafka community (the Phoenix/HBase example is a good one here). 2) Samza Reboot: The Samza community continues to exist with a limited project scope, but similarly would not need to be part of the Kafka community (ie given committership) to progress. Here, maybe the Samza team would become a subproject of Kafka (the Board frowns on subprojects at the moment, so I'm not sure if that's even feasible), but that would not be required. 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka team builds its own streaming library, possibly off of Jay's prototype, which has not direct lineage to the Samza team. There's no reason for the Kafka team to bring in the Samza team. Is the Kafka community on board with this? To be clear, all three options under discussion are interesting, technically valid and likely healthy directions for the project. Also, they are not mutually exclusive. The Samza community could decide to pursue, say, 'Samza 2.0', while the Kafka community went forward with 'Hey Samza!' My points above are directed entirely at the community aspect of these choices. -Jakob On 10 July 2015 at 09:10, Roger Hoover roger.hoo...@gmail.com wrote: That's great. Thanks, Jay. On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps j...@confluent.io wrote: Yeah totally agree. I think you have this issue even today, right? I.e. if you need to make a simple config change and you're running in YARN today you end up bouncing the job which then rebuilds state. I think the fix is exactly what you described which is to have a long timeout on partition movement for stateful jobs so that if a job is just getting bounced, and the cluster manager (or admin) is smart enough to restart it on the same host when possible, it can optimistically reuse any existing state it finds on disk (if it is valid). So in this model the charter of the CM is to place processes as stickily as possible and to restart or re-place failed processes. The charter of the partition management system is to control the
Re: Question on newBlockingQueue in BlockingEnvelopeMap
Hi Jae, I think the messages are not lost, instead, they all go to one partition, in your shared queue implementation. If you check the code in BlockingEnvelopeMap line 123 https://github.com/apache/samza/blob/master/samza-api/src/main/java/org/apache/samza/util/BlockingEnvelopeMap.java#L123 , it puts all the messages in the queue in one partition. Thanks, Fang, Yan yanfang...@gmail.com On Fri, Jul 10, 2015 at 12:36 PM, Bae, Jae Hyeon metac...@gmail.com wrote: Hi Samza devs and users I wrote customized Samza S3 consumer which downloads files from S3 and put messages in BlockedEnvelopeMap. It was straightforward because there's a nice example, filereader. I tried to a little optimize with newBlockingQueue() method because I guess that single queue shared could be fine because Samza container is single threaded. I added the following code: public S3Consumer(String systemName, Config config, MetricsRegistry registry) { queueSize = config.getInt(systems. + systemName + .queue.size, 1); bucket = config.get(systems. + systemName + .bucket); prefix = config.get(systems. + systemName + .prefix); queue = new LinkedBlockingQueue(queueSize); recordCounter = registry.newCounter(this.getClass().getName(), processed_records); } @Override protected BlockingQueueIncomingMessageEnvelope newBlockingQueue() { return queue; // single queue } Unfortunately, I observed significant message loss with this implementation. I suspected its queue might have dropped messages, so I changed newBlockingQueue() implementation same as filereader. @Override protected BlockingQueueIncomingMessageEnvelope newBlockingQueue() { return new LinkedBlockingQueue(queueSize); } Then, message loss didn't happen again. Do you have any idea why it went wrong? Thank you Best, Jae
Re: Thoughts and obesrvations on Samza
{quote} maintained in a separate repository and retaining the existing committership but sharing as much else as possible (website, etc) {quote} Overall, I agree on this idea. Now the question is more about how to do it. On the other hand, one thing I want to point out is that, if we decide to go this way, how do we want to support otherSystem-transformation-otherSystem use case? Basically, there are four user groups here: 1. Kafka-transformation-Kafka 2. Kafka-transformation-otherSystem 3. otherSystem-transformation-Kafka 4. otherSystem-transformation-otherSystem For group 1, they can easily use the new Samza library to achieve. For group 2 and 3, they can use copyCat - transformation - Kafka or Kafka- transformation - copyCat. The problem is for group 4. Do we want to abandon this or still support it? Of course, this use case can be achieved by using copyCat - transformation - Kafka - transformation - copyCat, the thing is how we persuade them to do this long chain. If yes, it will also be a win for Kafka too. Or if there is no one in this community actually doing this so far, maybe ok to not support the group 4 directly. Thanks, Fang, Yan yanfang...@gmail.com On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps j...@confluent.io wrote: Yeah I agree with this summary. I think there are kind of two questions here: 1. Technically does alignment/reliance on Kafka make sense 2. Branding wise (naming, website, concepts, etc) does alignment with Kafka make sense Personally I do think both of these things would be really valuable, and would dramatically alter the trajectory of the project. My preference would be to see if people can mostly agree on a direction rather than splintering things off. From my point of view the ideal outcome of all the options discussed would be to make Samza a closely aligned subproject, maintained in a separate repository and retaining the existing committership but sharing as much else as possible (website, etc). No idea about how these things work, Jacob, you probably know more. No discussion amongst the Kafka folks has happened on this, but likely we should figure out what the Samza community actually wants first. I admit that this is a fairly radical departure from how things are. If that doesn't fly, I think, yeah we could leave Samza as it is and do the more radical reboot inside Kafka. From my point of view that does leave things in a somewhat confusing state since now there are two stream processing systems more or less coupled to Kafka in large part made by the same people. But, arguably that might be a cleaner way to make the cut-over and perhaps less risky for Samza community since if it works people can switch and if it doesn't nothing will have changed. Dunno, how do people feel about this? -Jay On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan jgho...@gmail.com wrote: This leads me to thinking that merging projects and communities might be a good idea: with the union of experience from both communities, we will probably build a better system that is better for users. Is this what's being proposed though? Merging the projects seems like a consequence of at most one of the three directions under discussion: 1) Samza 2.0: The Samza community relies more heavily on Kafka for configuration, etc. (to a greater or lesser extent to be determined) but the Samza community would not automatically merge withe Kafka community (the Phoenix/HBase example is a good one here). 2) Samza Reboot: The Samza community continues to exist with a limited project scope, but similarly would not need to be part of the Kafka community (ie given committership) to progress. Here, maybe the Samza team would become a subproject of Kafka (the Board frowns on subprojects at the moment, so I'm not sure if that's even feasible), but that would not be required. 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka team builds its own streaming library, possibly off of Jay's prototype, which has not direct lineage to the Samza team. There's no reason for the Kafka team to bring in the Samza team. Is the Kafka community on board with this? To be clear, all three options under discussion are interesting, technically valid and likely healthy directions for the project. Also, they are not mutually exclusive. The Samza community could decide to pursue, say, 'Samza 2.0', while the Kafka community went forward with 'Hey Samza!' My points above are directed entirely at the community aspect of these choices. -Jakob On 10 July 2015 at 09:10, Roger Hoover roger.hoo...@gmail.com wrote: That's great. Thanks, Jay. On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps j...@confluent.io wrote: Yeah totally agree. I think you have this issue even today, right? I.e. if you need to make a simple config change and you're running in YARN today you end up bouncing the job which then rebuilds state.
Re: Question on newBlockingQueue in BlockingEnvelopeMap
I expected it should have not lost messages but it did. After I fixed overridden method, it was fixed. Anyway, thanks a lot for responding. On Fri, Jul 10, 2015 at 2:11 PM, Yan Fang yanfang...@gmail.com wrote: Hi Jae, I think the messages are not lost, instead, they all go to one partition, in your shared queue implementation. If you check the code in BlockingEnvelopeMap line 123 https://github.com/apache/samza/blob/master/samza-api/src/main/java/org/apache/samza/util/BlockingEnvelopeMap.java#L123 , it puts all the messages in the queue in one partition. Thanks, Fang, Yan yanfang...@gmail.com On Fri, Jul 10, 2015 at 12:36 PM, Bae, Jae Hyeon metac...@gmail.com wrote: Hi Samza devs and users I wrote customized Samza S3 consumer which downloads files from S3 and put messages in BlockedEnvelopeMap. It was straightforward because there's a nice example, filereader. I tried to a little optimize with newBlockingQueue() method because I guess that single queue shared could be fine because Samza container is single threaded. I added the following code: public S3Consumer(String systemName, Config config, MetricsRegistry registry) { queueSize = config.getInt(systems. + systemName + .queue.size, 1); bucket = config.get(systems. + systemName + .bucket); prefix = config.get(systems. + systemName + .prefix); queue = new LinkedBlockingQueue(queueSize); recordCounter = registry.newCounter(this.getClass().getName(), processed_records); } @Override protected BlockingQueueIncomingMessageEnvelope newBlockingQueue() { return queue; // single queue } Unfortunately, I observed significant message loss with this implementation. I suspected its queue might have dropped messages, so I changed newBlockingQueue() implementation same as filereader. @Override protected BlockingQueueIncomingMessageEnvelope newBlockingQueue() { return new LinkedBlockingQueue(queueSize); } Then, message loss didn't happen again. Do you have any idea why it went wrong? Thank you Best, Jae