Thanks guys this is good, so let's say i configured my kafka topics to ingest 
data from various streams (for we are talking forex tick data) I could 
partition out and buffer to hdfs (which has a replication factor) based on 
currency pair? I.e EURUSD ... 

The next question I have, is it entirely appropriate to continue consuming 
feeds (remember these are live feeds not pre generated) but not have an active 
samza job running over the feed at that point in time. This leads me back to my 
AM question. I am going to be consuming data continously however as a user i 
may want to setup and run jobs on the stream as it arrives in the context of 
all existing data or only on a subset of data (this may fall back to std map 
reduce job).

I also want to write my jobs in Cucumber but thats for another list .

Thoughts.

-------- Original message --------
From: Chris Riccomini <[email protected]> 
Date:24/04/2014  03:10  (GMT+10:00) 
To: [email protected] 
Subject: Re: Application Master 

Hey Steve,

One thing I'd add is that whereas Map/Reduce partitions tasks by file
split, Samza partitions tasks by input stream partition (i.e. Kafka topic
partition). It's true that a given key maps to just one partition in
Samza, but it's not a 1:1 relationship--multiple keys map to the same
input stream partition, and thus the same task. For example, task1 might
receive messages from partition0 of the input stream, which contains
messages for keys 0,2,4,6,8, etc..

Cheers,
Chris

On 4/22/14 10:46 PM, "Zhijie Shen" <[email protected]> wrote:

>AM is the master of an distributed application on YARN. It's supposed to
>negotiate with YARN for the cluster resources and monitor the status of
>the
>application. It's not associated with MapReduce. MapReduce V2 has its own
>AM, while Samza has one iteself as well.
>
>
>On Tue, Apr 22, 2014 at 3:40 PM, Steve Yates
><[email protected]>wrote:
>
>> Guys is it fair to say that YARN exposes an extension mechanism called
>>the
>> ApplicationMaster and by default in yarn this master is a map-reduce
>> application master.
>>
>> In the case of samza we have implemented a streaming case of this AM
>>which
>> takes full advantage of the parrallel / fault tolerant mechanisms built
>> into hadoop.
>>
>> So instead where we partition map reduce tasks bases on file size splits
>> in hdfs, we split a stream into stream tasks based on some filter key?
>>Is
>> this correct.
>>
>> -S
>
>
>
>
>-- 
>Zhijie Shen
>Hortonworks Inc.
>http://hortonworks.com/
>
>-- 
>CONFIDENTIALITY NOTICE
>NOTICE: This message is intended for the use of the individual or entity
>to 
>which it is addressed and may contain information that is confidential,
>privileged and exempt from disclosure under applicable law. If the reader
>of this message is not the intended recipient, you are hereby notified
>that 
>any printing, copying, dissemination, distribution, disclosure or
>forwarding of this communication is strictly prohibited. If you have
>received this communication in error, please contact the sender
>immediately 
>and delete it from your system. Thank You.

Reply via email to