Chris,

This is perfectly good answer. I will start poking more into option #4.

- Shekar


On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini <
[email protected]> wrote:

> Hey Shekar,
>
> Your two options are really (3) or (4), then. You can either run some
> external DB that holds the data set, and you can query it from a
> StreamTask, or you can use Samza's state store feature to push data into a
> stream that you can then store in a partitioned key-value store along with
> your StreamTasks. There is some documentation here about the state store
> approach:
>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state
> -management.html
>
>
> (4) is going to require more up front effort from you, since you'll have
> to understand how Kafka's partitioning model works, and setup some
> pipeline to push the updates for your state. In the long run, I believe
> it's the better approach, though. Local lookups on a key-value store
> should be faster than doing remote RPC calls to a DB for every message.
>
> I'm sorry I can't give you a more definitive answer. It's really about
> trade-offs.
>
> Cheers,
> Chris
>
> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote:
>
> >Chris,
> >
> >A big thanks for a swift response. The data set is huge and the frequency
> >is in burst.
> >What do you suggest?
> >
> >- Shekar
> >
> >
> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini <
> >[email protected]> wrote:
> >
> >> Hey Shekar,
> >>
> >> This is feasible, and you are on the right thought process.
> >>
> >> For the sake of discussion, I'm going to pretend that you have a Kafka
> >> topic called "PageViewEvent", which has just the IP address that was
> >>used
> >> to view a page. These messages will be logged every time a page view
> >> happens. I'm also going to pretend that you have some state called
> >>"IPGeo"
> >> (e.g. The maxmind data set). In this example, we'll want to join the
> >> long/lat geo information from IPGeo to the PageViewEvent, and send it
> >>to a
> >> new topic: PageViewEventsWithGeo.
> >>
> >> You have several options on how to implement this example.
> >>
> >> 1. If your joining data set (IPGeo) is relatively small and changes
> >> infrequently, you can just pack it up in your jar or .tgz file, and open
> >> it open in every StreamTask.
> >> 2. If your data set is small, but changes somewhat frequently, you can
> >> throw the data set on some HTTP/HDFS/S3 server somewhere, and have your
> >> StreamTask refresh it periodically by re-downloading it.
> >> 3. You can do remote RPC calls for the IPGeo data on every page view
> >>event
> >> by query some remote service or DB (e.g. Cassandra).
> >> 4. You can use Samza's state feature to set your IPGeo data as a series
> >>of
> >> messages to a log-compacted Kafka topic
> >> (https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction), and
> >> configure your Samza job to read this topic as a bootstrap stream
> >> (
> >>
> >>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/str
> >>e
> >> ams.html).
> >>
> >> For (4), you'd have to partition the IPGeo state topic according to the
> >> same key as PageViewEvent. If PageViewEvent were partitioned by, say,
> >> member ID, but you want your IPGeo state topic to be partitioned by IP
> >> address, then you'd have to have an upstream job that re-partitioned
> >> PageViewEvent into some new topic by IP address. This new topic will
> >>have
> >> to have the same number of partitions as the IPGeo state topic (if IPGeo
> >> has 8 partitions, then the new PageViewEventRepartitioned topic needs 8
> >>as
> >> well). This will cause your PageViewEventRepartitioned topic and your
> >> IPGeo state topic to be aligned such that the StreamTask that gets page
> >> views for IP address X will also have the IPGeo information for IP
> >>address
> >> X.
> >>
> >> Which strategy you pick is really up to you. :) (4) is the most
> >> complicated, but also the most flexible, and most operationally sound.
> >>(1)
> >> is the easiest if it fits your needs.
> >>
> >> Cheers,
> >> Chris
> >>
> >> On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]> wrote:
> >>
> >> >Hello,
> >> >
> >> >I am new to Samza. I have just installed Hello Samza and got it
> >>working.
> >> >
> >> >Here is the use case for which I am trying to use Samza:
> >> >
> >> >
> >> >1. Cache the contextual information which contains more information
> >>about
> >> >the hostname or IP address using Samza/Yarn/Kafka
> >> >2. Collect alert and metric events which contain either hostname or IP
> >> >address
> >> >3. Append contextual information to the alert and metric and insert to
> >>a
> >> >Kafka queue from which other subscribers read off of.
> >> >
> >> >Can you please shed some light on
> >> >
> >> >1. Is this feasible?
> >> >2. Am I on the right thought process
> >> >3. How do I start
> >> >
> >> >I now have 1 & 2 of them working disparately. I need to integrate them.
> >> >
> >> >Appreciate any input.
> >> >
> >> >- Shekar
> >>
> >>
>
>

Reply via email to