Hey Shekar,

This is feasible, and you are on the right thought process.

For the sake of discussion, I'm going to pretend that you have a Kafka
topic called "PageViewEvent", which has just the IP address that was used
to view a page. These messages will be logged every time a page view
happens. I'm also going to pretend that you have some state called "IPGeo"
(e.g. The maxmind data set). In this example, we'll want to join the
long/lat geo information from IPGeo to the PageViewEvent, and send it to a
new topic: PageViewEventsWithGeo.

You have several options on how to implement this example.

1. If your joining data set (IPGeo) is relatively small and changes
infrequently, you can just pack it up in your jar or .tgz file, and open
it open in every StreamTask.
2. If your data set is small, but changes somewhat frequently, you can
throw the data set on some HTTP/HDFS/S3 server somewhere, and have your
StreamTask refresh it periodically by re-downloading it.
3. You can do remote RPC calls for the IPGeo data on every page view event
by query some remote service or DB (e.g. Cassandra).
4. You can use Samza's state feature to set your IPGeo data as a series of
messages to a log-compacted Kafka topic
(https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction), and
configure your Samza job to read this topic as a bootstrap stream
(http://samza.incubator.apache.org/learn/documentation/0.7.0/container/stre
ams.html). 

For (4), you'd have to partition the IPGeo state topic according to the
same key as PageViewEvent. If PageViewEvent were partitioned by, say,
member ID, but you want your IPGeo state topic to be partitioned by IP
address, then you'd have to have an upstream job that re-partitioned
PageViewEvent into some new topic by IP address. This new topic will have
to have the same number of partitions as the IPGeo state topic (if IPGeo
has 8 partitions, then the new PageViewEventRepartitioned topic needs 8 as
well). This will cause your PageViewEventRepartitioned topic and your
IPGeo state topic to be aligned such that the StreamTask that gets page
views for IP address X will also have the IPGeo information for IP address
X.

Which strategy you pick is really up to you. :) (4) is the most
complicated, but also the most flexible, and most operationally sound. (1)
is the easiest if it fits your needs.

Cheers,
Chris

On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]> wrote:

>Hello,
>
>I am new to Samza. I have just installed Hello Samza and got it working.
>
>Here is the use case for which I am trying to use Samza:
>
>
>1. Cache the contextual information which contains more information about
>the hostname or IP address using Samza/Yarn/Kafka
>2. Collect alert and metric events which contain either hostname or IP
>address
>3. Append contextual information to the alert and metric and insert to a
>Kafka queue from which other subscribers read off of.
>
>Can you please shed some light on
>
>1. Is this feasible?
>2. Am I on the right thought process
>3. How do I start
>
>I now have 1 & 2 of them working disparately. I need to integrate them.
>
>Appreciate any input.
>
>- Shekar

Reply via email to