Apart from your data locality problem it sounds like what you want is a
workqueue. Kafka's consumer structure doesn't lend itself too well to
that use case as a single partition of a topic should only have one
consumer instance per logical subscriber of the topic, and that consumer
would not be able to mark jobs as completed except in a strict order
(while maintaining a processed successfully at least once guarantee).
This is not to say it cannot be done, but I believe your workqueue would
end up working a bit strangely if built with Kafka.

Christian

On 10/09/2014 06:13 AM, William Briggs wrote:
> Manually managing data locality will become difficult to scale. Kafka is
> one potential tool you can use to help scale, but by itself, it will not
> solve your problem. If you need the data in near-real time, you could use a
> technology like Spark or Storm to stream data from Kafka and perform your
> processing. If you can batch the data, you might be better off pulling it
> into a distributed filesystem like HDFS, and using MapReduce, Spark or
> another scalable processing framework to handle your transformations. Once
> you've paid the initial price for moving the document into HDFS, your
> network traffic should be fairly manageable; most clusters built for this
> purpose will schedule work to be run local to the data, and typically have
> separate, high-speed network interfaces and a dedicated switch in order to
> optimize intra-cluster communications when moving data is unavoidable.
> 
> -Will
> 
> On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila <albert.v...@augure.com> wrote:
> 
>> Hi
>>
>> I just came across Kafta when I was trying to find solutions to scale our
>> current architecture.
>>
>> We are currently downloading and processing 6M documents per day from
>> online and social media. We have a different workflow for each type of
>> document, but some of the steps are keyword extraction, language detection,
>> clustering, classification, indexation, .... We are using Gearman to
>> dispatch the job to workers and we have some queues on a database.
>>
>> I'm wondering if we could integrate Kafka on the current workflow and if
>> it's feasible. One of our main discussions are if we have to go to a fully
>> distributed architecture or to a semi-distributed one. I mean, distribute
>> everything or process some steps on the same machine (crawling, keyword
>> extraction, language detection, indexation). We don't know which one scales
>> more, each one has pros and cont.
>>
>> Now we have a semi-distributed one as we had network problems taking into
>> account the amount of data we were moving around. So now, all documents
>> crawled on server X, later on are dispatched through Gearman to the same
>> server. What we dispatch on Gearman is only the document id, and the
>> document data remains on the crawling server on a Memcached, so the network
>> traffic is keep at minimum.
>>
>> What do you think?
>> It's feasible to remove all database queues and Gearman and move to Kafka?
>> As Kafka is mainly based on messages I think we should move the messages
>> around, should we take into account the network? We may face the same
>> problems?
>> If so, there is a way to isolate some steps to be processed on the same
>> machine, to avoid network traffic?
>>
>> Any help or comment will be appreciate. And If someone has had a similar
>> problem and has knowledge about the architecture approach will be more than
>> welcomed.
>>
>> Thanks
>>
> 

Reply via email to