Thanks Yi, that helps a lot. For 1) though, I was still using YarnJobFactory, just found that I was getting that error in the logs when I look at YARN, so does that mean it is reprocessing messages, and will the job start in the same way if I restart the grid?
Thanks again, Connie On Sun, Aug 30, 2015 at 10:26 PM, Yi Pan <nickpa...@gmail.com> wrote: > Hi, Connie, > > Let me clarify a bit further: > 1. ThreadJobFactory and ProcessJobFactory do not work w/ any cluster > management systems (i.e. YARN). Hence, when you say that you are > starting/stopping the grid and using ThreadJobFactory configuration, that > will not work. The only job factory that works w/ YARN is YarnJobFactory. > 2. The rat error is a Apache license check. Make sure that the source files > in your project has the Apache license header at the beginning > 3. Samza process the messages in a single-threaded mode. Two messages from > two different system stream partitions will be delivered to StreamTask in > sequence. Join usually means that you will need to buffer some amount of > messages in each stream in the KV-store. When a new message from stream A > comes, join task will need to lookup the relevant messages in the buffer of > other streams to yield the result. > > Best, > > -Yi > > On Sun, Aug 30, 2015 at 10:11 PM, Connie Chen <spiritgr...@gmail.com> > wrote: > > > Hi Yi, > > > > Thanks for your response. A few more questions: > > > > 1) What are the semantics of the kafka topics in hello-world? (I am using > > the offline version that just produces updates in a loop). As in, if I do > > "bin/grid stop all" will the jobs re-read the same messages from Kafka or > > are they reading new ones? I get an error like: > > > > 2015-08-30 21:55:27 KafkaSystemConsumer [WARN] While refreshing brokers > for > > > [__samza_coordinator_wikipedia-stats_1,0]: > > org.apache.samza.SamzaException: > > > Already consuming TopicPartition > > [__samza_coordinator_wikipedia-stats_1,0]. > > > Retrying. > > > > > > when I restart the grid and wondering if the behavior after the restart > > should be the same every time. > > > > 2) I am using ThreadJobFactory as you suggested for "job.factory.class" > in > > the stats task, but getting this error: > > > > [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.9:check > > > (default) on project hello-samza: Too many files with unapproved > license: > > > > > > when I try to mvn package. > > > > 3) If I have two input streams, is the # of partitions configured from > > Kafka/ where can I set what partition the message goes to? Also, how > would > > I get the two messages from two streams in StreamTask? (seems like > > IncomingMessageEnvelope that process() provides only represents one > message > > from one stream, where would it take in the two messages sent to the same > > partition?) > > > > Thank you! > > > > Connie > > > > On Sun, Aug 30, 2015 at 8:34 PM, Yi Pan <nickpa...@gmail.com> wrote: > > > > > Hi, Connie, > > > > > > If I understand your trial example, you were trying to manually launch > > > Samza container via run-container.sh script. Unfortunately, this is > only > > > possible via ProcessJobFactory and ThreadJobFactory right now. Using > > YARN, > > > you will have to start the job via run-job.sh on the RM. Then, the > > > SamzaAppMaster will start the containers automatically. > > > > > > As for the join example that you are looking for, you should be able to > > > configure a job that takes two streams w/ the same number of partitions > > and > > > use the default job.systemstreampartition.grouper.factory. > > > Then, each of your task will take in the messages from two streams from > > the > > > same partition and your code implementing StreamTask should perform the > > > join logic. > > > > > > Hope that explains the procedure at high level. Feel free to ask if you > > > have further questions. > > > > > > Thanks! > > > > > > -Yi > > > > > > On Sun, Aug 30, 2015 at 5:57 PM, Connie Chen <spiritgr...@gmail.com> > > > wrote: > > > > > > > I am relatively new to Samza as well as YARN/Zookeeper/Kafka, I went > > > > through the hello-samza > > > > <http://samza.apache.org/startup/hello-samza/latest/>example and was > > > > wondering if there was more documentation/tutorials about > > SamzaContainer. > > > > > > > > There are some files under samza.examples.wikipedia.system in > > > hello-samza, > > > > it looks like they are using the container API (from reading here > > > > < > > > > > > > > > > http://samza.apache.org/learn/documentation/latest/container/samza-container.html > > > > > > > > > and here > > > > < > > > > > > http://samza.apache.org/learn/documentation/latest/container/streams.html > > > > >) > > > > and I can't figure out how to run them. > > > > > > > > I tried bin/run-container.sh but I get this number format exception: > > > > > > > > java.lang.NumberFormatException: null > > > > > at java.lang.Integer.parseInt(Integer.java:454) > > > > > at java.lang.Integer.parseInt(Integer.java:527) > > > > > at > > > > > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) > > > > > at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) > > > > > at > > > > > > > > > > > > > > > org.apache.samza.container.SamzaContainer$.safeMain(SamzaContainer.scala:82) > > > > > at > > > > > > org.apache.samza.container.SamzaContainer$.main(SamzaContainer.scala:69) > > > > > at > > org.apache.samza.container.SamzaContainer.main(SamzaContainer.scala) > > > > > > > > > > > > Anyone have tips or suggestions for how I can play around with this > > more? > > > > Mainly, I would like to try partitioning and see an example of > joining > > > two > > > > streams with the embedded k-v store. > > > > > > > > Thanks in advance! > > > > > > > > Connie > > > > > > > > > >