Re: Some Questions About Stream Partitions

Klaus Schaefers Tue, 10 Dec 2013 02:29:15 -0800

Hi,

thx for the detailed answers. But one more question, what do you mean be
"over-partition"?  Do you mean I would initially define lets say 100
partitions and then just assign 2 containers? When I need to scale out I
would just add more containers and Samza would the also redistribute the
state store? What if I want to reduce the number of containers (e.g.
hosting on EC2), can Samza merge the state stores from several containers
to one?


Cheers,

Klaus



On Mon, Dec 9, 2013 at 6:40 PM, Chris Riccomini <[email protected]>wrote:

> Hi Klaush,
>
> Thanks for your interest. I'm happy to answer any questions you've got.
>
> 1) In case of failure, will Samza restore the state automatically on
> another node?
>
> Yes. If you are using YARN
> (job.factory.class=org.apache.samza.job.yarn.YarnJobFactory), then YARN
> will see that a Samza container has failed. It will re-start the Samza
> container on a machine in the YARN grid. This node might be the same node
> on the grid, or it might be a new node. YARN makes that decision based on
> the available resources in the grid. When the Samza container starts up,
> Samza will restore the container's state to where it was before the
> failure occurred. Once the state has been fully restored, the Samza
> container will then start feeding your StreamTask new messages from the
> input streams.
>
> 2) If I want to scale out and increase the number of stream partitions. How
> is the local storage handled? Is it distributed by the framework as well
> according to the partitions?
>
>
> Samza's partitioning model is currently determined by the number of
> partitions that a Samza jobs input streams have. A Samza job will always
> have the max of the partition counts from all of its input streams. For
> example, if a Samza job has two input streams: A and B, and stream A has 4
> partitions, and stream B has 8, then the Samza job will have 8 partitions.
> The first four partitions (0-3) of the Samza job will receive messages
> from both stream A and Stream b, and the last 4 partitions (4-7) will
> receive messages ONLY from stream B. These partitions are run physically
> inside of Samza "containers". Samza containers are assigned partitions
> that they are responsible for processing. For example, if you had two
> Samza containers, the first container would process 4 partitions, and the
> second container would process the other four. With YARN, the number of
> containers you have is defined by the config yarn.container.count. Right
> now, you can never have more containers that input stream partitions.
>
> The state for each one of a Samza job's partitions is managed entirely
> independently. That is, each Samza partition has its own state store
> (LevelDb, if you're using samza-kv). So in the example above, if the
> StreamTask were using a single key-value store, there would be 8 LevelDb
> stores, one for each Samza partition. This allows us to move partitions
> between containers. If you were to decide you wanted 3 containers instead
> of 2, Samza would simply cease processing the partitions on the other two
> containers, restore the state for the third container, and then begin
> processing the partitions across all three containers. This model means
> that the maximum parallelism you can get when processing is 1 partition
> per container, and up to as many partitions as your input streams have.
>
> Samza does not support resizing an input stream's partition count right
> now. Once you start a Samza job, the partition counts for the input
> streams are assumed to be static. If you decide you need more parallelism,
> you need to start a new Samza job with a different job.name, and
> re-process all the input data again.
>
> Cheers,
> Chris
>
> On 12/9/13 8:02 AM, "Klaus Schaefers" <[email protected]> wrote:
>
> >Hi,
> >
> >I have been reading about the Samza and I like the concept behind it a
> >lot.
> >In particular the local key-value store is a good idea. However I have
> >some
> >short questions regarding the local state that I couldn't answer while
> >reading the web page. I would be very happy if someone could answer them
> >shortly. Here they are:
> >
> >
> >1) In case of failure, will Samza restore the state automatically on
> >another node?
> >
> >2) If I want to scale out and increase the number of stream partitions.
> >How
> >is the local storage handled? Is it distributed by the framework as well
> >according to the partitions?
> >
> >
> >Cheers,
> >
> >Klaus
> >
> >
> >
> >
> >--
> >
> >--
> >
> >Klaus Schaefers
> >Senior Optimization Manager
> >
> >Ligatus GmbH
> >Hohenstaufenring 30-32
> >D-50674 Köln
> >
> >Tel.:  +49 (0) 221 / 56939 -784
> >Fax:  +49 (0) 221 / 56 939 - 599
> >E-Mail: [email protected]
> >Web: www.ligatus.de
> >
> >HRB Köln 56003
> >Geschäftsführung:
> >Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
> >Dipl.-Wirtschaftsingenieur Arne Wolter
> >
> >
> >
> >--
> >
> >--
> >
> >Klaus Schaefers
> >Senior Optimization Manager
> >
> >Ligatus GmbH
> >Hohenstaufenring 30-32
> >D-50674 Köln
> >
> >Tel.:  +49 (0) 221 / 56939 -784
> >Fax:  +49 (0) 221 / 56 939 - 599
> >E-Mail: [email protected]
> >Web: www.ligatus.de
> >
> >HRB Köln 56003
> >Geschäftsführung:
> >Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
> >Dipl.-Wirtschaftsingenieur Arne Wolter
>
>


-- 

-- 

Klaus Schaefers
Senior Optimization Manager

Ligatus GmbH
Hohenstaufenring 30-32
D-50674 Köln

Tel.:  +49 (0) 221 / 56939 -784
Fax:  +49 (0) 221 / 56 939 - 599
E-Mail: [email protected]
Web: www.ligatus.de

HRB Köln 56003
Geschäftsführung:
Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
Dipl.-Wirtschaftsingenieur Arne Wolter

Re: Some Questions About Stream Partitions

Reply via email to