Chris,

A big thanks for a swift response. The data set is huge and the frequency
is in burst.
What do you suggest?

- Shekar


On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini <
[email protected]> wrote:

> Hey Shekar,
>
> This is feasible, and you are on the right thought process.
>
> For the sake of discussion, I'm going to pretend that you have a Kafka
> topic called "PageViewEvent", which has just the IP address that was used
> to view a page. These messages will be logged every time a page view
> happens. I'm also going to pretend that you have some state called "IPGeo"
> (e.g. The maxmind data set). In this example, we'll want to join the
> long/lat geo information from IPGeo to the PageViewEvent, and send it to a
> new topic: PageViewEventsWithGeo.
>
> You have several options on how to implement this example.
>
> 1. If your joining data set (IPGeo) is relatively small and changes
> infrequently, you can just pack it up in your jar or .tgz file, and open
> it open in every StreamTask.
> 2. If your data set is small, but changes somewhat frequently, you can
> throw the data set on some HTTP/HDFS/S3 server somewhere, and have your
> StreamTask refresh it periodically by re-downloading it.
> 3. You can do remote RPC calls for the IPGeo data on every page view event
> by query some remote service or DB (e.g. Cassandra).
> 4. You can use Samza's state feature to set your IPGeo data as a series of
> messages to a log-compacted Kafka topic
> (https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction), and
> configure your Samza job to read this topic as a bootstrap stream
> (
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/stre
> ams.html).
>
> For (4), you'd have to partition the IPGeo state topic according to the
> same key as PageViewEvent. If PageViewEvent were partitioned by, say,
> member ID, but you want your IPGeo state topic to be partitioned by IP
> address, then you'd have to have an upstream job that re-partitioned
> PageViewEvent into some new topic by IP address. This new topic will have
> to have the same number of partitions as the IPGeo state topic (if IPGeo
> has 8 partitions, then the new PageViewEventRepartitioned topic needs 8 as
> well). This will cause your PageViewEventRepartitioned topic and your
> IPGeo state topic to be aligned such that the StreamTask that gets page
> views for IP address X will also have the IPGeo information for IP address
> X.
>
> Which strategy you pick is really up to you. :) (4) is the most
> complicated, but also the most flexible, and most operationally sound. (1)
> is the easiest if it fits your needs.
>
> Cheers,
> Chris
>
> On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]> wrote:
>
> >Hello,
> >
> >I am new to Samza. I have just installed Hello Samza and got it working.
> >
> >Here is the use case for which I am trying to use Samza:
> >
> >
> >1. Cache the contextual information which contains more information about
> >the hostname or IP address using Samza/Yarn/Kafka
> >2. Collect alert and metric events which contain either hostname or IP
> >address
> >3. Append contextual information to the alert and metric and insert to a
> >Kafka queue from which other subscribers read off of.
> >
> >Can you please shed some light on
> >
> >1. Is this feasible?
> >2. Am I on the right thought process
> >3. How do I start
> >
> >I now have 1 & 2 of them working disparately. I need to integrate them.
> >
> >Appreciate any input.
> >
> >- Shekar
>
>

Reply via email to