Re: Review Request 17603: SAMZA-136 Editing documentation (introduction and comparisons sections)

Martin Kleppmann Sat, 08 Feb 2014 16:58:07 -0800


> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/comparisons/storm.md, line 10
> > <https://reviews.apache.org/r/17603/diff/2/?file=471143#file471143line10>
> >
> >     I think a spout is actually similar to a consumer (SystemConsumer) in 
> > Samza's parlance.
> >     
> >     In Storm, a spout is a thing that feeds messages from a stream into 
> > Storm's toplogies. This is what a SystemConsumer does with Samza.

Good point, I'll update it to "spouts in Storm are similar to stream consumers 
in Samza".

> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/comparisons/storm.md, line 18
> > <https://reviews.apache.org/r/17603/diff/2/?file=471143#file471143line18>
> >
> >     Even Storm's "exactly once" messaging is somewhat misleading.
> >     
> >     First, Storm only guarantees exactly once messaging within its 
> > framework? That is, if a Kafka producer sends a message, then times out 
> > (but the message makes it to the broker before the timeout), and re-sends, 
> > Storm's spout will process both messages (duplicates). This isn't really 
> > Storm's fault, but the point is that you get duplicate messages processed 
> > by your bolts.
> >     
> >     Second, what happens in the "exactly once" case in cases where the bolt 
> > is mutating state while processing a batch, and a failure occurs? As far as 
> > I know, Storm's state management requires idempotent operations, and only 
> > occurs outside of the topology, right?
> >     
> >     It might be worth discussing this, as these are both things that Samza 
> > and Kafka are attempting to address.

Re point 1, yes: messages are actually processed at least once by bolts, but 
the side-effects of the processing on state (e.g. counters) are idempotent when 
retried, so that the value of the state looks as though messages were processed 
exactly once.

I'm adding a note to make clear that exactly-once in Storm does not apply to 
external side-effects, such as sending messages to a broker.

Re point 2, I think the state implementation only requires atomic writes for a 
single key (i.e. in a K-V store, either the value for a key is updated or not, 
but you can't get old and new value spliced together). The idempotence is 
ensured by metadata in the value.

When idempotent producer is released, we'll have to update this doc again.

> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/comparisons/storm.md, line 40
> > <https://reviews.apache.org/r/17603/diff/2/?file=471143#file471143line40>
> >
> >     This is somewhat confusing. Samza does not hold a single job per 
> > process. You can have N processes (SamzaContainers) for a single job. This 
> > is configured with YARN jobs using yarn.container.count.
> >     
> >     Might be worth calling out that a single Storm process with 100 threads 
> > is equivalent to a Samza job with 100 containers.
> >     
> >

Yes, agree this was confusing. I thought it would be better to restructure this 
paragraph. Here's the new version -- is it better?

Storm's [parallelism 
model](https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology)
 is fairly similar to Samza's. Both frameworks split processing into 
independent *tasks* that can run in parallel. Resource allocation is 
independent of the number of tasks: a small job can keep all tasks in a single 
process on a single machine; a large job can spread the tasks over many 
processes on many machines.

The biggest difference is that Storm uses one thread per task by default, 
whereas Samza uses single-threaded processes (containers). A Samza container 
may contain multiple tasks, but there is only one thread that invokes each of 
the tasks in turn. This means each container is mapped to exactly one CPU core, 
which makes the resource model much simpler and reduces interference from other 
tasks running on the same machine. Storm's multithreaded model has the 
advantage of taking better advantage of excess capacity on an idle machine, at 
the cost of a less predictable resource model.

> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/comparisons/storm.md, line 68
> > <https://reviews.apache.org/r/17603/diff/2/?file=471143#file471143line68>
> >
> >     You might want to call this out in the exactly once discussion above. 
> > If you have two topologies communicating with each other, they need to send 
> > messages through an underlying system (Kafka, HDFS, Kestrel, etc). This 
> > will break exactly-once messaging.

Done.

> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/comparisons/storm.md, line 96
> > <https://reviews.apache.org/r/17603/diff/2/?file=471143#file471143line96>
> >
> >     Can't this be done in Samza by running a web service in a container, 
> > using streams to pass messages, and then having the web service container 
> > block until it receives a response message?

I think it could be done in Samza, but you'd have to do all the message routing 
yourself, making sure that the response is matched back to the request that 
generated it. Storm provides that as a built-in feature. (Storm's low-latency 
ZeroMQ communication perhaps also helps making this more practical, compared to 
several hops through Kafka?)

Personally I don't think this is a very important feature -- I reckon it's 
useful only for very specialised use cases, and Nathan happened to have such a 
use case for Twitter analytics. But I thought it was worth mentioning, as the 
Storm docs talk about it prominently, and it's kinda clever.

I'll add a note saying you can build DRPC yourself if you want it.

> On Feb. 7, 2014, 6:46 p.m., Chris Riccomini wrote:
> > docs/learn/documentation/0.7.0/introduction/background.md, line 34
> > <https://reviews.apache.org/r/17603/diff/2/?file=471147#file471147line34>
> >
> >     Can you make these changes to Samza's index (landing) page as well? 
> > These two descriptions are identical, and should ideally be kept in sync.

Ok, I've brought the landing page and this page back in sync.

- Martin

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17603/#review33936
-----------------------------------------------------------

On Feb. 6, 2014, 10:58 p.m., Martin Kleppmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/17603/
> -----------------------------------------------------------
> 
> (Updated Feb. 6, 2014, 10:58 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Copy-edited the 'introduction' and 'comparisons' sections of the 
> documentation, to make it more fluid to read.
> 
> Changed all uses of the word 'member' (which is quite LinkedIn-specific 
> terminology) to refer to 'user' instead.
> 
> Rewrote the explanation of state manatement (in comparisions/introduction) as 
> I found it confusing.
> 
> Rewrote the page comparing Samza with Storm, because it was outdated and no 
> longer represented Storm accurately.
> 
> 
> Diffs
> -----
> 
>   docs/img/0.7.0/learn/documentation/introduction/dag.png 
> bda85b2244df5f65f5472d557900fa2a65ea55c9 
>   docs/img/0.7.0/learn/documentation/introduction/group-by-example.png 
> 1acd355c4565ee484540897c9c1712ae0c03d185 
>   docs/learn/documentation/0.7.0/api/overview.md 
> b2324a411e8929c03971fd64a94699e8f6ded809 
>   docs/learn/documentation/0.7.0/comparisons/introduction.md 
> b70697ba51604b6d6b1c49e4e8ff0376d5d92ec1 
>   docs/learn/documentation/0.7.0/comparisons/mupd8.md 
> bb0d5a11691ae80725e51b799ab56d65edcb36db 
>   docs/learn/documentation/0.7.0/comparisons/storm.md 
> b87c2077db2527041d8ed0397e2720772862dc60 
>   docs/learn/documentation/0.7.0/container/task-runner.md 
> 27dab79f76a34385db5e6bebec42dd0964cbb878 
>   docs/learn/documentation/0.7.0/container/windowing.md 
> 6058707e7d51986e8e36770303835673956a50b6 
>   docs/learn/documentation/0.7.0/introduction/architecture.md 
> ff8357dd0397156aebdc9fa30964b18c7a71c376 
>   docs/learn/documentation/0.7.0/introduction/background.md 
> 52d8e41cccbeb5851578c95dd0edca24f2b8471f 
>   docs/learn/documentation/0.7.0/introduction/concepts.md 
> 2736bf0985c78d0314ed2011dc768cbbc5453f49 
> 
> Diff: https://reviews.apache.org/r/17603/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Martin Kleppmann
> 
>

Re: Review Request 17603: SAMZA-136 Editing documentation (introduction and comparisons sections)

Reply via email to