Chris, This is perfectly good answer. I will start poking more into option #4.
- Shekar On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini < [email protected]> wrote: > Hey Shekar, > > Your two options are really (3) or (4), then. You can either run some > external DB that holds the data set, and you can query it from a > StreamTask, or you can use Samza's state store feature to push data into a > stream that you can then store in a partitioned key-value store along with > your StreamTasks. There is some documentation here about the state store > approach: > > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state > -management.html > > > (4) is going to require more up front effort from you, since you'll have > to understand how Kafka's partitioning model works, and setup some > pipeline to push the updates for your state. In the long run, I believe > it's the better approach, though. Local lookups on a key-value store > should be faster than doing remote RPC calls to a DB for every message. > > I'm sorry I can't give you a more definitive answer. It's really about > trade-offs. > > Cheers, > Chris > > On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote: > > >Chris, > > > >A big thanks for a swift response. The data set is huge and the frequency > >is in burst. > >What do you suggest? > > > >- Shekar > > > > > >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini < > >[email protected]> wrote: > > > >> Hey Shekar, > >> > >> This is feasible, and you are on the right thought process. > >> > >> For the sake of discussion, I'm going to pretend that you have a Kafka > >> topic called "PageViewEvent", which has just the IP address that was > >>used > >> to view a page. These messages will be logged every time a page view > >> happens. I'm also going to pretend that you have some state called > >>"IPGeo" > >> (e.g. The maxmind data set). In this example, we'll want to join the > >> long/lat geo information from IPGeo to the PageViewEvent, and send it > >>to a > >> new topic: PageViewEventsWithGeo. > >> > >> You have several options on how to implement this example. > >> > >> 1. If your joining data set (IPGeo) is relatively small and changes > >> infrequently, you can just pack it up in your jar or .tgz file, and open > >> it open in every StreamTask. > >> 2. If your data set is small, but changes somewhat frequently, you can > >> throw the data set on some HTTP/HDFS/S3 server somewhere, and have your > >> StreamTask refresh it periodically by re-downloading it. > >> 3. You can do remote RPC calls for the IPGeo data on every page view > >>event > >> by query some remote service or DB (e.g. Cassandra). > >> 4. You can use Samza's state feature to set your IPGeo data as a series > >>of > >> messages to a log-compacted Kafka topic > >> (https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction), and > >> configure your Samza job to read this topic as a bootstrap stream > >> ( > >> > >> > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/str > >>e > >> ams.html). > >> > >> For (4), you'd have to partition the IPGeo state topic according to the > >> same key as PageViewEvent. If PageViewEvent were partitioned by, say, > >> member ID, but you want your IPGeo state topic to be partitioned by IP > >> address, then you'd have to have an upstream job that re-partitioned > >> PageViewEvent into some new topic by IP address. This new topic will > >>have > >> to have the same number of partitions as the IPGeo state topic (if IPGeo > >> has 8 partitions, then the new PageViewEventRepartitioned topic needs 8 > >>as > >> well). This will cause your PageViewEventRepartitioned topic and your > >> IPGeo state topic to be aligned such that the StreamTask that gets page > >> views for IP address X will also have the IPGeo information for IP > >>address > >> X. > >> > >> Which strategy you pick is really up to you. :) (4) is the most > >> complicated, but also the most flexible, and most operationally sound. > >>(1) > >> is the easiest if it fits your needs. > >> > >> Cheers, > >> Chris > >> > >> On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]> wrote: > >> > >> >Hello, > >> > > >> >I am new to Samza. I have just installed Hello Samza and got it > >>working. > >> > > >> >Here is the use case for which I am trying to use Samza: > >> > > >> > > >> >1. Cache the contextual information which contains more information > >>about > >> >the hostname or IP address using Samza/Yarn/Kafka > >> >2. Collect alert and metric events which contain either hostname or IP > >> >address > >> >3. Append contextual information to the alert and metric and insert to > >>a > >> >Kafka queue from which other subscribers read off of. > >> > > >> >Can you please shed some light on > >> > > >> >1. Is this feasible? > >> >2. Am I on the right thought process > >> >3. How do I start > >> > > >> >I now have 1 & 2 of them working disparately. I need to integrate them. > >> > > >> >Appreciate any input. > >> > > >> >- Shekar > >> > >> > >
