Hi, thx for the detailed answers. But one more question, what do you mean be "over-partition"? Do you mean I would initially define lets say 100 partitions and then just assign 2 containers? When I need to scale out I would just add more containers and Samza would the also redistribute the state store? What if I want to reduce the number of containers (e.g. hosting on EC2), can Samza merge the state stores from several containers to one?
Cheers, Klaus On Mon, Dec 9, 2013 at 6:40 PM, Chris Riccomini <[email protected]>wrote: > Hi Klaush, > > Thanks for your interest. I'm happy to answer any questions you've got. > > 1) In case of failure, will Samza restore the state automatically on > another node? > > Yes. If you are using YARN > (job.factory.class=org.apache.samza.job.yarn.YarnJobFactory), then YARN > will see that a Samza container has failed. It will re-start the Samza > container on a machine in the YARN grid. This node might be the same node > on the grid, or it might be a new node. YARN makes that decision based on > the available resources in the grid. When the Samza container starts up, > Samza will restore the container's state to where it was before the > failure occurred. Once the state has been fully restored, the Samza > container will then start feeding your StreamTask new messages from the > input streams. > > 2) If I want to scale out and increase the number of stream partitions. How > is the local storage handled? Is it distributed by the framework as well > according to the partitions? > > > Samza's partitioning model is currently determined by the number of > partitions that a Samza jobs input streams have. A Samza job will always > have the max of the partition counts from all of its input streams. For > example, if a Samza job has two input streams: A and B, and stream A has 4 > partitions, and stream B has 8, then the Samza job will have 8 partitions. > The first four partitions (0-3) of the Samza job will receive messages > from both stream A and Stream b, and the last 4 partitions (4-7) will > receive messages ONLY from stream B. These partitions are run physically > inside of Samza "containers". Samza containers are assigned partitions > that they are responsible for processing. For example, if you had two > Samza containers, the first container would process 4 partitions, and the > second container would process the other four. With YARN, the number of > containers you have is defined by the config yarn.container.count. Right > now, you can never have more containers that input stream partitions. > > The state for each one of a Samza job's partitions is managed entirely > independently. That is, each Samza partition has its own state store > (LevelDb, if you're using samza-kv). So in the example above, if the > StreamTask were using a single key-value store, there would be 8 LevelDb > stores, one for each Samza partition. This allows us to move partitions > between containers. If you were to decide you wanted 3 containers instead > of 2, Samza would simply cease processing the partitions on the other two > containers, restore the state for the third container, and then begin > processing the partitions across all three containers. This model means > that the maximum parallelism you can get when processing is 1 partition > per container, and up to as many partitions as your input streams have. > > Samza does not support resizing an input stream's partition count right > now. Once you start a Samza job, the partition counts for the input > streams are assumed to be static. If you decide you need more parallelism, > you need to start a new Samza job with a different job.name, and > re-process all the input data again. > > Cheers, > Chris > > On 12/9/13 8:02 AM, "Klaus Schaefers" <[email protected]> wrote: > > >Hi, > > > >I have been reading about the Samza and I like the concept behind it a > >lot. > >In particular the local key-value store is a good idea. However I have > >some > >short questions regarding the local state that I couldn't answer while > >reading the web page. I would be very happy if someone could answer them > >shortly. Here they are: > > > > > >1) In case of failure, will Samza restore the state automatically on > >another node? > > > >2) If I want to scale out and increase the number of stream partitions. > >How > >is the local storage handled? Is it distributed by the framework as well > >according to the partitions? > > > > > >Cheers, > > > >Klaus > > > > > > > > > >-- > > > >-- > > > >Klaus Schaefers > >Senior Optimization Manager > > > >Ligatus GmbH > >Hohenstaufenring 30-32 > >D-50674 Köln > > > >Tel.: +49 (0) 221 / 56939 -784 > >Fax: +49 (0) 221 / 56 939 - 599 > >E-Mail: [email protected] > >Web: www.ligatus.de > > > >HRB Köln 56003 > >Geschäftsführung: > >Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann, > >Dipl.-Wirtschaftsingenieur Arne Wolter > > > > > > > >-- > > > >-- > > > >Klaus Schaefers > >Senior Optimization Manager > > > >Ligatus GmbH > >Hohenstaufenring 30-32 > >D-50674 Köln > > > >Tel.: +49 (0) 221 / 56939 -784 > >Fax: +49 (0) 221 / 56 939 - 599 > >E-Mail: [email protected] > >Web: www.ligatus.de > > > >HRB Köln 56003 > >Geschäftsführung: > >Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann, > >Dipl.-Wirtschaftsingenieur Arne Wolter > > -- -- Klaus Schaefers Senior Optimization Manager Ligatus GmbH Hohenstaufenring 30-32 D-50674 Köln Tel.: +49 (0) 221 / 56939 -784 Fax: +49 (0) 221 / 56 939 - 599 E-Mail: [email protected] Web: www.ligatus.de HRB Köln 56003 Geschäftsführung: Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann, Dipl.-Wirtschaftsingenieur Arne Wolter
