Re: Assigning partitions to specific nodes

2015-03-16 Thread Gwen Shapira
Ah, got it. To actually answer your question: Replica assignment tool allows you to assign partitions to specific brokers. Since you always read from the lead replica, you can mark specific replicas as "preferred replica" and they will be the leader if they are available. You'll need to get each

Re: Assigning partitions to specific nodes

2015-03-16 Thread Daniel Haviv
Let's say I have 3 different types of algorithms that are implemented by three streaming apps (on Yarn). They are completely separate, meaning that they can run in parallel on the same data and not sequentially. *Using Kafka: *File X is loaded into HDFS and I want Algorithms A and B to process it

Re: Assigning partitions to specific nodes

2015-03-16 Thread Gwen Shapira
Probably off-topic for Kafka list, but why do you think you need multiple copies of the file to parallelize access? You'll have parallel access based on how many containers you have on the machine (if you are using YARN-Spark). On Mon, Mar 16, 2015 at 1:20 PM, Daniel Haviv wrote: > Hi, > The reas

Re: Assigning partitions to specific nodes

2015-03-16 Thread Daniel Haviv
Hi, The reason we want to use this method is that this way a file can be consumed by different streaming apps simultaneously (they just consume it's path from kafka and open it locally). With fileStream to parallelize the processing of a specific file I will have to make several copies of it,

Re: Assigning partitions to specific nodes

2015-03-16 Thread Gwen Shapira
Any reason not to use SparkStreaming directly with HDFS files, so you'll get locality guarantees from the Hadoop framework? StreamContext has textFileStream() method you could use for this. On Mon, Mar 16, 2015 at 12:46 PM, Daniel Haviv wrote: > Hi, > Is it possible to assign specific partitions

Assigning partitions to specific nodes

2015-03-16 Thread Daniel Haviv
Hi, Is it possible to assign specific partitions to specific nodes? I want to upload files to HDFS, find out on which nodes the file resides and then push their path into a topic and partition it by nodes. This way I can ensure that the consumer (Spark Streaming) will consume both the message and f