Manually managing data locality will become difficult to scale. Kafka is
one potential tool you can use to help scale, but by itself, it will not
solve your problem. If you need the data in near-real time, you could use a
technology like Spark or Storm to stream data from Kafka and perform your
processing. If you can batch the data, you might be better off pulling it
into a distributed filesystem like HDFS, and using MapReduce, Spark or
another scalable processing framework to handle your transformations. Once
you've paid the initial price for moving the document into HDFS, your
network traffic should be fairly manageable; most clusters built for this
purpose will schedule work to be run local to the data, and typically have
separate, high-speed network interfaces and a dedicated switch in order to
optimize intra-cluster communications when moving data is unavoidable.

-Will

On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila <albert.v...@augure.com> wrote:

> Hi
>
> I just came across Kafta when I was trying to find solutions to scale our
> current architecture.
>
> We are currently downloading and processing 6M documents per day from
> online and social media. We have a different workflow for each type of
> document, but some of the steps are keyword extraction, language detection,
> clustering, classification, indexation, .... We are using Gearman to
> dispatch the job to workers and we have some queues on a database.
>
> I'm wondering if we could integrate Kafka on the current workflow and if
> it's feasible. One of our main discussions are if we have to go to a fully
> distributed architecture or to a semi-distributed one. I mean, distribute
> everything or process some steps on the same machine (crawling, keyword
> extraction, language detection, indexation). We don't know which one scales
> more, each one has pros and cont.
>
> Now we have a semi-distributed one as we had network problems taking into
> account the amount of data we were moving around. So now, all documents
> crawled on server X, later on are dispatched through Gearman to the same
> server. What we dispatch on Gearman is only the document id, and the
> document data remains on the crawling server on a Memcached, so the network
> traffic is keep at minimum.
>
> What do you think?
> It's feasible to remove all database queues and Gearman and move to Kafka?
> As Kafka is mainly based on messages I think we should move the messages
> around, should we take into account the network? We may face the same
> problems?
> If so, there is a way to isolate some steps to be processed on the same
> machine, to avoid network traffic?
>
> Any help or comment will be appreciate. And If someone has had a similar
> problem and has knowledge about the architecture approach will be more than
> welcomed.
>
> Thanks
>

Reply via email to