Ah, got it.
To actually answer your question: Replica assignment tool allows you
to assign partitions to specific brokers. Since you always read from
the lead replica, you can mark specific replicas as "preferred
replica" and they will be the leader if they are available.
You'll need to get each
Let's say I have 3 different types of algorithms that are implemented by
three streaming apps (on Yarn).
They are completely separate, meaning that they can run in parallel on the
same data and not sequentially.
*Using Kafka: *File X is loaded into HDFS and I want Algorithms A and B to
process it
Probably off-topic for Kafka list, but why do you think you need
multiple copies of the file to parallelize access?
You'll have parallel access based on how many containers you have on
the machine (if you are using YARN-Spark).
On Mon, Mar 16, 2015 at 1:20 PM, Daniel Haviv
wrote:
> Hi,
> The reas
Hi,
The reason we want to use this method is that this way a file can be consumed
by different streaming apps simultaneously (they just consume it's path from
kafka and open it locally).
With fileStream to parallelize the processing of a specific file I will have to
make several copies of it,
Any reason not to use SparkStreaming directly with HDFS files, so
you'll get locality guarantees from the Hadoop framework?
StreamContext has textFileStream() method you could use for this.
On Mon, Mar 16, 2015 at 12:46 PM, Daniel Haviv
wrote:
> Hi,
> Is it possible to assign specific partitions
Hi,
Is it possible to assign specific partitions to specific nodes?
I want to upload files to HDFS, find out on which nodes the file resides
and then push their path into a topic and partition it by nodes.
This way I can ensure that the consumer (Spark Streaming) will consume both
the message and f