Some comments below.

On 10 October 2014 11:19, cac...@gmail.com <cac...@gmail.com> wrote:

> Albert, you certainly can use Kafka (and it will probably work quite well)
> you'll just need to make sure your consumers are written to match the
> available options. I think I may not have a good picture of what you need
> to do. Is it that you have a stream of documents coming in and then each
> document can go through crawling, keyword extraction, language detection,
> and indexation in any order, or are the steps ordered?
>

Some steps are ordered and some other can be done in parallel. Everything
starts after the crawler, because it outputs documents. Then those
documents follow a graph of steps where we enrich their data.



> The difference between a workqueue and what Kafka provides is that with a
> workqueue all the documents would be in one big queue and all the consumers
> have equal access to each document and can process the documents out of
> order if required. In Kafka to have multiple consumers on one subscription
> the data is partitioned with consumers assigned to particular partitions
> (though this can happen in a coordinated fashion at runtime) and each
> consumer will only chug through the data in its partition with messages
> marked as being processed in a strict order. You can use this partitioning
> to control which consumer will process a message but you may not want to.
> Does this help at all?


I think it does. In fact, what I had in mind is to have a topic for each
step, for example:

keyword topic [{partition: server1, message: document content}, {...}]
language topic [{partition: server1, message: document content}, {...}]
...

Then, some of the consumers will be on the same consumer group (to act as a
distributed queue) and some other with different consumer group (to act as
a pub/sub when we can run some steps in parallel).

Is that correct or I miss something about how Kafka works? What I see is
that by doing so, the document content will be duplicated on different
topics, but maybe it's not a problem by Kafka (document size around 50k)
Is it ok to partition data by the crawling server (right now 14 servers)?
Or should we use some other field to give more distribution?



> Without forethought  using Kafka you wouldn't be
> able to suddenly increase the number of consumers to process a backlog
> since you could have at most one consumer per partition (but with
> forethought you could start with a larger number of partitions so as to be
> able to increase the number of consumers by having each consumer process
> fewer partitions at a time).
>

I don't understand the last paragraph. Can't I have a consumer group
containing X consumer instances? So later on, Can't I increase the number
of the instances?



> As for Storm, I don't believe there's anything stopping each bolt from just
> being concerned with a single document at a time, and if your stages are
> sequential then your use of Kafka might well be equivalent to a simple
> linear Storm topology.
>
> Exactly, that's why we are evaluating if only with Kafka is enough.
Because if Storm gives us the same benefits than Kafka it's better to stick
with only one technology to keep everything as simple as possible.


> Christian
>

Thanks



>
> On Thu, Oct 9, 2014 at 11:57 PM, Albert Vila <albert.v...@augure.com>
> wrote:
>
> > Hi
> >
> > We process data in real time, and we are taking a look at Storm and Spark
> > streaming too, however our actions are atomic, done at a document level
> so
> > I don't know if it fits on something like Storm/Spark.
> >
> > Regarding what you Christian said, isn't Kafka used for scenarios like
> the
> > one I described? I mean, we do have work queues right now with Gearman,
> but
> > with a bunch of workers on each step. I thought we could change that to a
> > producer and a bunch of consumers (where the message should only reach
> one
> > and exact one consumer).
> >
> > And what I said about the data locally, it was only an optimization we
> did
> > some time ago because we was moving more data back then. Maybe now its
> not
> > necessary and we could move messages around the system using Kafka, so it
> > will allow us to simplify the architecture a little bit. I've seen people
> > saying they move Tb of data every day using Kafka.
> >
> > Just to be clear on the size of each document/message, we are talking
> about
> > tweets, blog posts, ... (on 90% of cases the size is less than 50Kb)
> >
> > Regards
> >
> > On 9 October 2014 20:02, Christian Csar <cac...@gmail.com> wrote:
> >
> > > Apart from your data locality problem it sounds like what you want is a
> > > workqueue. Kafka's consumer structure doesn't lend itself too well to
> > > that use case as a single partition of a topic should only have one
> > > consumer instance per logical subscriber of the topic, and that
> consumer
> > > would not be able to mark jobs as completed except in a strict order
> > > (while maintaining a processed successfully at least once guarantee).
> > > This is not to say it cannot be done, but I believe your workqueue
> would
> > > end up working a bit strangely if built with Kafka.
> > >
> > > Christian
> > >
> > > On 10/09/2014 06:13 AM, William Briggs wrote:
> > > > Manually managing data locality will become difficult to scale. Kafka
> > is
> > > > one potential tool you can use to help scale, but by itself, it will
> > not
> > > > solve your problem. If you need the data in near-real time, you could
> > > use a
> > > > technology like Spark or Storm to stream data from Kafka and perform
> > your
> > > > processing. If you can batch the data, you might be better off
> pulling
> > it
> > > > into a distributed filesystem like HDFS, and using MapReduce, Spark
> or
> > > > another scalable processing framework to handle your transformations.
> > > Once
> > > > you've paid the initial price for moving the document into HDFS, your
> > > > network traffic should be fairly manageable; most clusters built for
> > this
> > > > purpose will schedule work to be run local to the data, and typically
> > > have
> > > > separate, high-speed network interfaces and a dedicated switch in
> order
> > > to
> > > > optimize intra-cluster communications when moving data is
> unavoidable.
> > > >
> > > > -Will
> > > >
> > > > On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila <albert.v...@augure.com>
> > > wrote:
> > > >
> > > >> Hi
> > > >>
> > > >> I just came across Kafta when I was trying to find solutions to
> scale
> > > our
> > > >> current architecture.
> > > >>
> > > >> We are currently downloading and processing 6M documents per day
> from
> > > >> online and social media. We have a different workflow for each type
> of
> > > >> document, but some of the steps are keyword extraction, language
> > > detection,
> > > >> clustering, classification, indexation, .... We are using Gearman to
> > > >> dispatch the job to workers and we have some queues on a database.
> > > >>
> > > >> I'm wondering if we could integrate Kafka on the current workflow
> and
> > if
> > > >> it's feasible. One of our main discussions are if we have to go to a
> > > fully
> > > >> distributed architecture or to a semi-distributed one. I mean,
> > > distribute
> > > >> everything or process some steps on the same machine (crawling,
> > keyword
> > > >> extraction, language detection, indexation). We don't know which one
> > > scales
> > > >> more, each one has pros and cont.
> > > >>
> > > >> Now we have a semi-distributed one as we had network problems taking
> > > into
> > > >> account the amount of data we were moving around. So now, all
> > documents
> > > >> crawled on server X, later on are dispatched through Gearman to the
> > same
> > > >> server. What we dispatch on Gearman is only the document id, and the
> > > >> document data remains on the crawling server on a Memcached, so the
> > > network
> > > >> traffic is keep at minimum.
> > > >>
> > > >> What do you think?
> > > >> It's feasible to remove all database queues and Gearman and move to
> > > Kafka?
> > > >> As Kafka is mainly based on messages I think we should move the
> > messages
> > > >> around, should we take into account the network? We may face the
> same
> > > >> problems?
> > > >> If so, there is a way to isolate some steps to be processed on the
> > same
> > > >> machine, to avoid network traffic?
> > > >>
> > > >> Any help or comment will be appreciate. And If someone has had a
> > similar
> > > >> problem and has knowledge about the architecture approach will be
> more
> > > than
> > > >> welcomed.
> > > >>
> > > >> Thanks
> > > >>
> > > >
> > >
> > >
> >
> >
> > --
> > *Albert Vila*
> > R&D Manager & Software Developer
> >
> >
> > Tél. : +34 972 982 968
> >
> > *www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action
> > <http://blog.augure.es/> | *Twitter. *@AugureSpain
> > <https://twitter.com/AugureSpain>
> > *Skype *: albert.vila | *Access map.* Augure Girona
> > <
> >
> https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16
> > >
> >
>



-- 
*Albert Vila*
R&D Manager & Software Developer


Tél. : +34 972 982 968

*www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action
<http://blog.augure.es/> | *Twitter. *@AugureSpain
<https://twitter.com/AugureSpain>
*Skype *: albert.vila | *Access map.* Augure Girona
<https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16>

Reply via email to