Hi

We process data in real time, and we are taking a look at Storm and Spark
streaming too, however our actions are atomic, done at a document level so
I don't know if it fits on something like Storm/Spark.

Regarding what you Christian said, isn't Kafka used for scenarios like the
one I described? I mean, we do have work queues right now with Gearman, but
with a bunch of workers on each step. I thought we could change that to a
producer and a bunch of consumers (where the message should only reach one
and exact one consumer).

And what I said about the data locally, it was only an optimization we did
some time ago because we was moving more data back then. Maybe now its not
necessary and we could move messages around the system using Kafka, so it
will allow us to simplify the architecture a little bit. I've seen people
saying they move Tb of data every day using Kafka.

Just to be clear on the size of each document/message, we are talking about
tweets, blog posts, ... (on 90% of cases the size is less than 50Kb)

Regards

On 9 October 2014 20:02, Christian Csar <cac...@gmail.com> wrote:

> Apart from your data locality problem it sounds like what you want is a
> workqueue. Kafka's consumer structure doesn't lend itself too well to
> that use case as a single partition of a topic should only have one
> consumer instance per logical subscriber of the topic, and that consumer
> would not be able to mark jobs as completed except in a strict order
> (while maintaining a processed successfully at least once guarantee).
> This is not to say it cannot be done, but I believe your workqueue would
> end up working a bit strangely if built with Kafka.
>
> Christian
>
> On 10/09/2014 06:13 AM, William Briggs wrote:
> > Manually managing data locality will become difficult to scale. Kafka is
> > one potential tool you can use to help scale, but by itself, it will not
> > solve your problem. If you need the data in near-real time, you could
> use a
> > technology like Spark or Storm to stream data from Kafka and perform your
> > processing. If you can batch the data, you might be better off pulling it
> > into a distributed filesystem like HDFS, and using MapReduce, Spark or
> > another scalable processing framework to handle your transformations.
> Once
> > you've paid the initial price for moving the document into HDFS, your
> > network traffic should be fairly manageable; most clusters built for this
> > purpose will schedule work to be run local to the data, and typically
> have
> > separate, high-speed network interfaces and a dedicated switch in order
> to
> > optimize intra-cluster communications when moving data is unavoidable.
> >
> > -Will
> >
> > On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila <albert.v...@augure.com>
> wrote:
> >
> >> Hi
> >>
> >> I just came across Kafta when I was trying to find solutions to scale
> our
> >> current architecture.
> >>
> >> We are currently downloading and processing 6M documents per day from
> >> online and social media. We have a different workflow for each type of
> >> document, but some of the steps are keyword extraction, language
> detection,
> >> clustering, classification, indexation, .... We are using Gearman to
> >> dispatch the job to workers and we have some queues on a database.
> >>
> >> I'm wondering if we could integrate Kafka on the current workflow and if
> >> it's feasible. One of our main discussions are if we have to go to a
> fully
> >> distributed architecture or to a semi-distributed one. I mean,
> distribute
> >> everything or process some steps on the same machine (crawling, keyword
> >> extraction, language detection, indexation). We don't know which one
> scales
> >> more, each one has pros and cont.
> >>
> >> Now we have a semi-distributed one as we had network problems taking
> into
> >> account the amount of data we were moving around. So now, all documents
> >> crawled on server X, later on are dispatched through Gearman to the same
> >> server. What we dispatch on Gearman is only the document id, and the
> >> document data remains on the crawling server on a Memcached, so the
> network
> >> traffic is keep at minimum.
> >>
> >> What do you think?
> >> It's feasible to remove all database queues and Gearman and move to
> Kafka?
> >> As Kafka is mainly based on messages I think we should move the messages
> >> around, should we take into account the network? We may face the same
> >> problems?
> >> If so, there is a way to isolate some steps to be processed on the same
> >> machine, to avoid network traffic?
> >>
> >> Any help or comment will be appreciate. And If someone has had a similar
> >> problem and has knowledge about the architecture approach will be more
> than
> >> welcomed.
> >>
> >> Thanks
> >>
> >
>
>


-- 
*Albert Vila*
R&D Manager & Software Developer


Tél. : +34 972 982 968

*www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action
<http://blog.augure.es/> | *Twitter. *@AugureSpain
<https://twitter.com/AugureSpain>
*Skype *: albert.vila | *Access map.* Augure Girona
<https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16>

Reply via email to