Re: Samza as a Caching layer

Chris Riccomini Wed, 03 Sep 2014 09:03:53 -0700

Hey Shekar,

The SLF4J stuff is saying that you don't have an slf4j binding on your
classpath. Try adding slf4j-log4j as a runtime dependency on your project.


Cheers,
Chris

On 9/2/14 3:24 PM, "Shekar Tippur" <[email protected]> wrote:

>Chris ,
>
>In the current state, I just want Samza to connect to 127.0.0.1.
>
>I have set YARN_HOME to
>
>$ echo $YARN_HOME
>
>/home/ctippur/hello-samza/deploy/yarn
>
>I still dont see anything on hadoop console.
>
> Also, I see this during startup
>
>
>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>
>SLF4J: Defaulting to no-operation (NOP) logger implementation
>
>SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
>details.
>
>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>
>SLF4J: Defaulting to no-operation (NOP) logger implementation
>
>SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
>details.
>
>- Shekar
>
>
>On Tue, Sep 2, 2014 at 2:20 PM, Chris Riccomini <
>[email protected]> wrote:
>
>> Hey Shekar,
>>
>> Ah ha. In that case, do you expect your SamzaContainer to try to connect
>> to the RM at 127.0.0.1, or do you expect it to try to connect to some
>> remote RM? If you expect it to try and connect to a remote RM, it's not
>> doing that. Perhaps because YARN_HOME isn't set.
>>
>> If you go to your RM's web interface, how many active nodes do you see
>> listed?
>>
>> Cheers,
>> Chris
>>
>> On 9/2/14 2:17 PM, "Shekar Tippur" <[email protected]> wrote:
>>
>> >Chris ..
>> >
>> >I am using a rhel server to host all the components (Yarn, kafka,
>>samza).
>> >I
>> >dont have ACLs open to wikipedia.
>> >I am following ..
>> >
>> 
>>http://samza.incubator.apache.org/learn/tutorials/latest/run-hello-samza-
>>w
>> >ithout-internet.html
>> >
>> >- Shekar
>> >
>> >
>> >
>> >On Tue, Sep 2, 2014 at 2:13 PM, Chris Riccomini <
>> >[email protected]> wrote:
>> >
>> >> Hey Shekar,
>> >>
>> >> Can you try changing that to:
>> >>
>> >>   http://127.0.0.1:8088/cluster
>> >>
>> >>
>> >> And see if you can connect?
>> >>
>> >> Cheers,
>> >> Chris
>> >>
>> >> On 9/2/14 1:21 PM, "Shekar Tippur" <[email protected]> wrote:
>> >>
>> >> >Other observation is ..
>> >> >
>> >> >http://10.132.62.185:8088/cluster shows that no applications are
>> >>running.
>> >> >
>> >> >- Shekar
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >On Tue, Sep 2, 2014 at 1:15 PM, Shekar Tippur <[email protected]>
>> >>wrote:
>> >> >
>> >> >> Yarn seem to be running ..
>> >> >>
>> >> >> yarn      5462  0.0  2.0 1641296 161404 ?      Sl   Jun20  95:26
>> >> >> /usr/java/jdk1.6.0_31/bin/java -Dproc_resourcemanager -Xmx1000m
>> >> >> -Dhadoop.log.dir=/var/log/hadoop-yarn
>> >> >>-Dyarn.log.dir=/var/log/hadoop-yarn
>> >> >> -Dhadoop.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log
>> >> >> -Dyarn.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log
>> >> >> -Dyarn.home.dir=/usr/lib/hadoop-yarn
>> >> >>-Dhadoop.home.dir=/usr/lib/hadoop-yarn
>> >> >> -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA
>> >> >> -Djava.library.path=/usr/lib/hadoop/lib/native -classpath
>> >> >>
>> >>
>> 
>>>>>>/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/li
>>>>>>b/
>> >>>>*:
>> >>
>> 
>>>>>>/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib
>>>>>>/*
>> >>>>:/
>> >>
>> 
>>>>>>usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-y
>>>>>>ar
>> >>>>n/
>> >>
>> 
>>>>>>.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/
>>>>>>us
>> >>>>r/
>> >>
>> 
>>>>>>lib/hadoop-yarn/.//*:/usr/lib/hadoop-yarn/lib/*:/etc/hadoop/conf/rm-c
>>>>>>on
>> >>>>fi
>> >> >>g/log4j.properties
>> >> >> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
>> >> >>
>> >> >> I can tail kafka topic as well ..
>> >> >>
>> >> >> deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper
>> >>localhost:2181
>> >> >>--topic wikipedia-raw
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Tue, Sep 2, 2014 at 1:04 PM, Chris Riccomini <
>> >> >> [email protected]> wrote:
>> >> >>
>> >> >>> Hey Shekar,
>> >> >>>
>> >> >>> It looks like your job is hanging trying to connect to the RM on
>> >>your
>> >> >>> localhost. I thought that you said that your job was running in
>> >>local
>> >> >>> mode. If so, it should be using the LocalJobFactory. If not, and
>>you
>> >> >>> intend to run on YARN, is your YARN RM up and running on
>>localhost?
>> >> >>>
>> >> >>> Cheers,
>> >> >>> Chris
>> >> >>>
>> >> >>> On 9/2/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote:
>> >> >>>
>> >> >>> >Chris ..
>> >> >>> >
>> >> >>> >$ cat ./deploy/samza/undefined-samza-container-name.log
>> >> >>> >
>> >> >>> >2014-09-02 11:17:58 JobRunner [INFO] job factory:
>> >> >>> >org.apache.samza.job.yarn.YarnJobFactory
>> >> >>> >
>> >> >>> >2014-09-02 11:17:59 ClientHelper [INFO] trying to connect to RM
>> >> >>> >127.0.0.1:8032
>> >> >>> >
>> >> >>> >2014-09-02 11:17:59 NativeCodeLoader [WARN] Unable to load
>> >> >>>native-hadoop
>> >> >>> >library for your platform... using builtin-java classes where
>> >> >>>applicable
>> >> >>> >
>> >> >>> >2014-09-02 11:17:59 RMProxy [INFO] Connecting to ResourceManager
>> >>at /
>> >> >>> >127.0.0.1:8032
>> >> >>> >
>> >> >>> >
>> >> >>> >and Log4j ..
>> >> >>> >
>> >> >>> ><!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
>> >> >>> >
>> >> >>> ><log4j:configuration
>> >>xmlns:log4j="http://jakarta.apache.org/log4j/";>
>> >> >>> >
>> >> >>> >  <appender name="RollingAppender"
>> >> >>> >class="org.apache.log4j.DailyRollingFileAppender">
>> >> >>> >
>> >> >>> >     <param name="File"
>> >> >>> >value="${samza.log.dir}/${samza.container.name}.log"
>> >> >>> >/>
>> >> >>> >
>> >> >>> >     <param name="DatePattern" value="'.'yyyy-MM-dd" />
>> >> >>> >
>> >> >>> >     <layout class="org.apache.log4j.PatternLayout">
>> >> >>> >
>> >> >>> >      <param name="ConversionPattern" value="%d{yyyy-MM-dd
>> >>HH:mm:ss}
>> >> >>> %c{1}
>> >> >>> >[%p] %m%n" />
>> >> >>> >
>> >> >>> >     </layout>
>> >> >>> >
>> >> >>> >  </appender>
>> >> >>> >
>> >> >>> >  <root>
>> >> >>> >
>> >> >>> >    <priority value="info" />
>> >> >>> >
>> >> >>> >    <appender-ref ref="RollingAppender"/>
>> >> >>> >
>> >> >>> >  </root>
>> >> >>> >
>> >> >>> ></log4j:configuration>
>> >> >>> >
>> >> >>> >
>> >> >>> >On Tue, Sep 2, 2014 at 12:02 PM, Chris Riccomini <
>> >> >>> >[email protected]> wrote:
>> >> >>> >
>> >> >>> >> Hey Shekar,
>> >> >>> >>
>> >> >>> >> Can you attach your log files? I'm wondering if it's a
>> >> >>>mis-configured
>> >> >>> >> log4j.xml (or missing slf4j-log4j jar), which is leading to
>> >>nearly
>> >> >>> empty
>> >> >>> >> log files. Also, I'm wondering if the job starts fully.
>>Anything
>> >>you
>> >> >>> can
>> >> >>> >> attach would be helpful.
>> >> >>> >>
>> >> >>> >> Cheers,
>> >> >>> >> Chris
>> >> >>> >>
>> >> >>> >> On 9/2/14 11:43 AM, "Shekar Tippur" <[email protected]> wrote:
>> >> >>> >>
>> >> >>> >> >I am running in local mode.
>> >> >>> >> >
>> >> >>> >> >S
>> >> >>> >> >On Sep 2, 2014 11:42 AM, "Yan Fang" <[email protected]>
>> >>wrote:
>> >> >>> >> >
>> >> >>> >> >> Hi Shekar.
>> >> >>> >> >>
>> >> >>> >> >> Are you running job in local mode or yarn mode? If yarn
>>mode,
>> >>the
>> >> >>> log
>> >> >>> >> >>is in
>> >> >>> >> >> the yarn's container log.
>> >> >>> >> >>
>> >> >>> >> >> Thanks,
>> >> >>> >> >>
>> >> >>> >> >> Fang, Yan
>> >> >>> >> >> [email protected]
>> >> >>> >> >> +1 (206) 849-4108
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur
>> >> >>><[email protected]>
>> >> >>> >> >>wrote:
>> >> >>> >> >>
>> >> >>> >> >> > Chris,
>> >> >>> >> >> >
>> >> >>> >> >> > Got some time to play around a bit more.
>> >> >>> >> >> > I tried to edit
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/Wiki
>>>>>>>>>>>pe
>> >>>>>>>>>di
>> >> >>>>>>>aFe
>> >> >>> >>>>ed
>> >> >>> >> >>StreamTask.java
>> >> >>> >> >> > to add a logger info statement to tap the incoming
>>message.
>> >>I
>> >> >>>dont
>> >> >>> >>see
>> >> >>> >> >> the
>> >> >>> >> >> > messages being printed to the log file.
>> >> >>> >> >> >
>> >> >>> >> >> > Is this the right place to start?
>> >> >>> >> >> >
>> >> >>> >> >> > public class WikipediaFeedStreamTask implements
>>StreamTask {
>> >> >>> >> >> >
>> >> >>> >> >> >   private static final SystemStream OUTPUT_STREAM = new
>> >> >>> >> >> > SystemStream("kafka",
>> >> >>> >> >> > "wikipedia-raw");
>> >> >>> >> >> >
>> >> >>> >> >> >   private static final Logger log =
>>LoggerFactory.getLogger
>> >> >>> >> >> > (WikipediaFeedStreamTask.class);
>> >> >>> >> >> >
>> >> >>> >> >> >   @Override
>> >> >>> >> >> >
>> >> >>> >> >> >   public void process(IncomingMessageEnvelope envelope,
>> >> >>> >> >>MessageCollector
>> >> >>> >> >> > collector, TaskCoordinator coordinator) {
>> >> >>> >> >> >
>> >> >>> >> >> >     Map<String, Object> outgoingMap =
>> >> >>> >> >> > WikipediaFeedEvent.toMap((WikipediaFeedEvent)
>> >> >>> >>envelope.getMessage());
>> >> >>> >> >> >
>> >> >>> >> >> >     log.info(envelope.getMessage().toString());
>> >> >>> >> >> >
>> >> >>> >> >> >     collector.send(new
>> >>OutgoingMessageEnvelope(OUTPUT_STREAM,
>> >> >>> >> >> > outgoingMap));
>> >> >>> >> >> >
>> >> >>> >> >> >   }
>> >> >>> >> >> >
>> >> >>> >> >> > }
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini <
>> >> >>> >> >> > [email protected]> wrote:
>> >> >>> >> >> >
>> >> >>> >> >> > > Hey Shekar,
>> >> >>> >> >> > >
>> >> >>> >> >> > > Your thought process is on the right track. It's
>>probably
>> >> >>>best
>> >> >>> to
>> >> >>> >> >>start
>> >> >>> >> >> > > with hello-samza, and modify it to get what you want.
>>To
>> >> >>>start
>> >> >>> >>with,
>> >> >>> >> >> > > you'll want to:
>> >> >>> >> >> > >
>> >> >>> >> >> > > 1. Write a simple StreamTask that just does something
>> >>silly
>> >> >>>like
>> >> >>> >> >>just
>> >> >>> >> >> > > print messages that it receives.
>> >> >>> >> >> > > 2. Write a configuration for the job that consumes from
>> >>just
>> >> >>>the
>> >> >>> >> >>stream
>> >> >>> >> >> > > (alerts from different sources).
>> >> >>> >> >> > > 3. Run this to make sure you've got it working.
>> >> >>> >> >> > > 4. Now add your table join. This can be either a
>> >>change-data
>> >> >>> >>capture
>> >> >>> >> >> > (CDC)
>> >> >>> >> >> > > stream, or via a remote DB call.
>> >> >>> >> >> > >
>> >> >>> >> >> > > That should get you to a point where you've got your
>>job
>> >>up
>> >> >>>and
>> >> >>> >> >> running.
>> >> >>> >> >> > > From there, you could create your own Maven project,
>>and
>> >> >>>migrate
>> >> >>> >> >>your
>> >> >>> >> >> > code
>> >> >>> >> >> > > over accordingly.
>> >> >>> >> >> > >
>> >> >>> >> >> > > Cheers,
>> >> >>> >> >> > > Chris
>> >> >>> >> >> > >
>> >> >>> >> >> > > On 8/24/14 1:42 AM, "Shekar Tippur" <[email protected]>
>> >> >>>wrote:
>> >> >>> >> >> > >
>> >> >>> >> >> > > >Chris,
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >I have gone thro the documentation and decided that
>>the
>> >> >>>option
>> >> >>> >> >>that is
>> >> >>> >> >> > > >most
>> >> >>> >> >> > > >suitable for me is stream-table.
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >I see the following things:
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >1. Point samza to a table (database)
>> >> >>> >> >> > > >2. Point Samza to a stream - Alert stream from
>>different
>> >> >>> sources
>> >> >>> >> >> > > >3. Join key like a hostname
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >I have Hello Samza working. To extend that to do what
>>my
>> >> >>>needs
>> >> >>> >> >>are, I
>> >> >>> >> >> am
>> >> >>> >> >> > > >not sure where to start (Needs more code change OR
>> >> >>> configuration
>> >> >>> >> >> changes
>> >> >>> >> >> > > >OR
>> >> >>> >> >> > > >both)?
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >I have gone thro
>> >> >>> >> >> > > >
>> >> >>> >> >> >
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> http://samza.incubator.apache.org/learn/documentation/latest/api/overvie
>> >> >>>w
>> >> >>> >> >> > > .
>> >> >>> >> >> > > >html
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >Is my thought process on the right track? Can you
>>please
>> >> >>>point
>> >> >>> >>me
>> >> >>> >> >>to
>> >> >>> >> >> the
>> >> >>> >> >> > > >right direction?
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >- Shekar
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur
>> >> >>> >><[email protected]>
>> >> >>> >> >> > wrote:
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >> Chris,
>> >> >>> >> >> > > >>
>> >> >>> >> >> > > >> This is perfectly good answer. I will start poking
>>more
>> >> >>>into
>> >> >>> >> >>option
>> >> >>> >> >> > #4.
>> >> >>> >> >> > > >>
>> >> >>> >> >> > > >> - Shekar
>> >> >>> >> >> > > >>
>> >> >>> >> >> > > >>
>> >> >>> >> >> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini <
>> >> >>> >> >> > > >> [email protected]> wrote:
>> >> >>> >> >> > > >>
>> >> >>> >> >> > > >>> Hey Shekar,
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>> Your two options are really (3) or (4), then. You
>>can
>> >> >>>either
>> >> >>> >>run
>> >> >>> >> >> some
>> >> >>> >> >> > > >>> external DB that holds the data set, and you can
>> >>query it
>> >> >>> >>from a
>> >> >>> >> >> > > >>> StreamTask, or you can use Samza's state store
>> >>feature to
>> >> >>> >>push
>> >> >>> >> >>data
>> >> >>> >> >> > > >>>into a
>> >> >>> >> >> > > >>> stream that you can then store in a partitioned
>> >>key-value
>> >> >>> >>store
>> >> >>> >> >> along
>> >> >>> >> >> > > >>>with
>> >> >>> >> >> > > >>> your StreamTasks. There is some documentation here
>> >>about
>> >> >>>the
>> >> >>> >> >>state
>> >> >>> >> >> > > >>>store
>> >> >>> >> >> > > >>> approach:
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > >
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
>> >> >>> >> >> > > >>>ate
>> >> >>> >> >> > > >>> -management.html
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>><
>> >> >>> >> >> > >
>> >> >>> >> >>
>> >> >>>
>> >> >>>>>
>> >> 
>>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/
>> >> >>>>>s
>> >> >>> >> >> > > >>>tate-management.html>
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>> (4) is going to require more up front effort from
>>you,
>> >> >>>since
>> >> >>> >> >>you'll
>> >> >>> >> >> > > >>>have
>> >> >>> >> >> > > >>> to understand how Kafka's partitioning model works,
>> >>and
>> >> >>> setup
>> >> >>> >> >>some
>> >> >>> >> >> > > >>> pipeline to push the updates for your state. In the
>> >>long
>> >> >>> >>run, I
>> >> >>> >> >> > believe
>> >> >>> >> >> > > >>> it's the better approach, though. Local lookups on
>>a
>> >> >>> >>key-value
>> >> >>> >> >> store
>> >> >>> >> >> > > >>> should be faster than doing remote RPC calls to a
>>DB
>> >>for
>> >> >>> >>every
>> >> >>> >> >> > message.
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>> I'm sorry I can't give you a more definitive
>>answer.
>> >>It's
>> >> >>> >>really
>> >> >>> >> >> > about
>> >> >>> >> >> > > >>> trade-offs.
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>> Cheers,
>> >> >>> >> >> > > >>> Chris
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur"
>> >><[email protected]>
>> >> >>> >>wrote:
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>> >Chris,
>> >> >>> >> >> > > >>> >
>> >> >>> >> >> > > >>> >A big thanks for a swift response. The data set is
>> >>huge
>> >> >>>and
>> >> >>> >>the
>> >> >>> >> >> > > >>>frequency
>> >> >>> >> >> > > >>> >is in burst.
>> >> >>> >> >> > > >>> >What do you suggest?
>> >> >>> >> >> > > >>> >
>> >> >>> >> >> > > >>> >- Shekar
>> >> >>> >> >> > > >>> >
>> >> >>> >> >> > > >>> >
>> >> >>> >> >> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini
>><
>> >> >>> >> >> > > >>> >[email protected]> wrote:
>> >> >>> >> >> > > >>> >
>> >> >>> >> >> > > >>> >> Hey Shekar,
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >> This is feasible, and you are on the right
>>thought
>> >> >>> >>process.
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >> For the sake of discussion, I'm going to pretend
>> >>that
>> >> >>>you
>> >> >>> >> >>have a
>> >> >>> >> >> > > >>>Kafka
>> >> >>> >> >> > > >>> >> topic called "PageViewEvent", which has just
>>the IP
>> >> >>> >>address
>> >> >>> >> >>that
>> >> >>> >> >> > was
>> >> >>> >> >> > > >>> >>used
>> >> >>> >> >> > > >>> >> to view a page. These messages will be logged
>>every
>> >> >>>time
>> >> >>> a
>> >> >>> >> >>page
>> >> >>> >> >> > view
>> >> >>> >> >> > > >>> >> happens. I'm also going to pretend that you have
>> >>some
>> >> >>> >>state
>> >> >>> >> >> called
>> >> >>> >> >> > > >>> >>"IPGeo"
>> >> >>> >> >> > > >>> >> (e.g. The maxmind data set). In this example,
>>we'll
>> >> >>>want
>> >> >>> >>to
>> >> >>> >> >>join
>> >> >>> >> >> > the
>> >> >>> >> >> > > >>> >> long/lat geo information from IPGeo to the
>> >> >>>PageViewEvent,
>> >> >>> >>and
>> >> >>> >> >> send
>> >> >>> >> >> > > >>>it
>> >> >>> >> >> > > >>> >>to a
>> >> >>> >> >> > > >>> >> new topic: PageViewEventsWithGeo.
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >> You have several options on how to implement
>>this
>> >> >>> example.
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >> 1. If your joining data set (IPGeo) is
>>relatively
>> >> >>>small
>> >> >>> >>and
>> >> >>> >> >> > changes
>> >> >>> >> >> > > >>> >> infrequently, you can just pack it up in your
>>jar
>> >>or
>> >> >>>.tgz
>> >> >>> >> >>file,
>> >> >>> >> >> > and
>> >> >>> >> >> > > >>> open
>> >> >>> >> >> > > >>> >> it open in every StreamTask.
>> >> >>> >> >> > > >>> >> 2. If your data set is small, but changes
>>somewhat
>> >> >>> >> >>frequently,
>> >> >>> >> >> you
>> >> >>> >> >> > > >>>can
>> >> >>> >> >> > > >>> >> throw the data set on some HTTP/HDFS/S3 server
>> >> >>>somewhere,
>> >> >>> >>and
>> >> >>> >> >> have
>> >> >>> >> >> > > >>>your
>> >> >>> >> >> > > >>> >> StreamTask refresh it periodically by
>> >>re-downloading
>> >> >>>it.
>> >> >>> >> >> > > >>> >> 3. You can do remote RPC calls for the IPGeo
>>data
>> >>on
>> >> >>> every
>> >> >>> >> >>page
>> >> >>> >> >> > view
>> >> >>> >> >> > > >>> >>event
>> >> >>> >> >> > > >>> >> by query some remote service or DB (e.g.
>> >>Cassandra).
>> >> >>> >> >> > > >>> >> 4. You can use Samza's state feature to set your
>> >>IPGeo
>> >> >>> >>data
>> >> >>> >> >>as a
>> >> >>> >> >> > > >>>series
>> >> >>> >> >> > > >>> >>of
>> >> >>> >> >> > > >>> >> messages to a log-compacted Kafka topic
>> >> >>> >> >> > > >>> >> (
>> >> >>> >> >>
>> >>https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction
>> >> >>> >> >> > ),
>> >> >>> >> >> > > >>> and
>> >> >>> >> >> > > >>> >> configure your Samza job to read this topic as a
>> >> >>> bootstrap
>> >> >>> >> >> stream
>> >> >>> >> >> > > >>> >> (
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > >
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
>> >> >>> >> >> > > >>>r
>> >> >>> >> >> > > >>> >>e
>> >> >>> >> >> > > >>> >> ams.html).
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >> For (4), you'd have to partition the IPGeo state
>> >>topic
>> >> >>> >> >>according
>> >> >>> >> >> > to
>> >> >>> >> >> > > >>>the
>> >> >>> >> >> > > >>> >> same key as PageViewEvent. If PageViewEvent were
>> >> >>> >>partitioned
>> >> >>> >> >>by,
>> >> >>> >> >> > > >>>say,
>> >> >>> >> >> > > >>> >> member ID, but you want your IPGeo state topic
>>to
>> >>be
>> >> >>> >> >>partitioned
>> >> >>> >> >> > by
>> >> >>> >> >> > > >>>IP
>> >> >>> >> >> > > >>> >> address, then you'd have to have an upstream job
>> >>that
>> >> >>> >> >> > re-partitioned
>> >> >>> >> >> > > >>> >> PageViewEvent into some new topic by IP address.
>> >>This
>> >> >>>new
>> >> >>> >> >>topic
>> >> >>> >> >> > will
>> >> >>> >> >> > > >>> >>have
>> >> >>> >> >> > > >>> >> to have the same number of partitions as the
>>IPGeo
>> >> >>>state
>> >> >>> >> >>topic
>> >> >>> >> >> (if
>> >> >>> >> >> > > >>> IPGeo
>> >> >>> >> >> > > >>> >> has 8 partitions, then the new
>> >> >>>PageViewEventRepartitioned
>> >> >>> >> >>topic
>> >> >>> >> >> > > >>>needs 8
>> >> >>> >> >> > > >>> >>as
>> >> >>> >> >> > > >>> >> well). This will cause your
>> >>PageViewEventRepartitioned
>> >> >>> >>topic
>> >> >>> >> >>and
>> >> >>> >> >> > > >>>your
>> >> >>> >> >> > > >>> >> IPGeo state topic to be aligned such that the
>> >> >>>StreamTask
>> >> >>> >>that
>> >> >>> >> >> gets
>> >> >>> >> >> > > >>>page
>> >> >>> >> >> > > >>> >> views for IP address X will also have the IPGeo
>> >> >>> >>information
>> >> >>> >> >>for
>> >> >>> >> >> IP
>> >> >>> >> >> > > >>> >>address
>> >> >>> >> >> > > >>> >> X.
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >> Which strategy you pick is really up to you. :)
>> >>(4) is
>> >> >>> the
>> >> >>> >> >>most
>> >> >>> >> >> > > >>> >> complicated, but also the most flexible, and
>>most
>> >> >>> >> >>operationally
>> >> >>> >> >> > > >>>sound.
>> >> >>> >> >> > > >>> >>(1)
>> >> >>> >> >> > > >>> >> is the easiest if it fits your needs.
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >> Cheers,
>> >> >>> >> >> > > >>> >> Chris
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur"
>> >> >>><[email protected]>
>> >> >>> >> >>wrote:
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >> >Hello,
>> >> >>> >> >> > > >>> >> >
>> >> >>> >> >> > > >>> >> >I am new to Samza. I have just installed Hello
>> >>Samza
>> >> >>>and
>> >> >>> >> >>got it
>> >> >>> >> >> > > >>> >>working.
>> >> >>> >> >> > > >>> >> >
>> >> >>> >> >> > > >>> >> >Here is the use case for which I am trying to
>>use
>> >> >>>Samza:
>> >> >>> >> >> > > >>> >> >
>> >> >>> >> >> > > >>> >> >
>> >> >>> >> >> > > >>> >> >1. Cache the contextual information which
>>contains
>> >> >>>more
>> >> >>> >> >> > information
>> >> >>> >> >> > > >>> >>about
>> >> >>> >> >> > > >>> >> >the hostname or IP address using
>>Samza/Yarn/Kafka
>> >> >>> >> >> > > >>> >> >2. Collect alert and metric events which
>>contain
>> >> >>>either
>> >> >>> >> >> hostname
>> >> >>> >> >> > > >>>or IP
>> >> >>> >> >> > > >>> >> >address
>> >> >>> >> >> > > >>> >> >3. Append contextual information to the alert
>>and
>> >> >>>metric
>> >> >>> >>and
>> >> >>> >> >> > > >>>insert to
>> >> >>> >> >> > > >>> >>a
>> >> >>> >> >> > > >>> >> >Kafka queue from which other subscribers read
>>off
>> >>of.
>> >> >>> >> >> > > >>> >> >
>> >> >>> >> >> > > >>> >> >Can you please shed some light on
>> >> >>> >> >> > > >>> >> >
>> >> >>> >> >> > > >>> >> >1. Is this feasible?
>> >> >>> >> >> > > >>> >> >2. Am I on the right thought process
>> >> >>> >> >> > > >>> >> >3. How do I start
>> >> >>> >> >> > > >>> >> >
>> >> >>> >> >> > > >>> >> >I now have 1 & 2 of them working disparately. I
>> >>need
>> >> >>>to
>> >> >>> >> >> integrate
>> >> >>> >> >> > > >>> them.
>> >> >>> >> >> > > >>> >> >
>> >> >>> >> >> > > >>> >> >Appreciate any input.
>> >> >>> >> >> > > >>> >> >
>> >> >>> >> >> > > >>> >> >- Shekar
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>> >>
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>>
>> >> >>> >> >> > > >>
>> >> >>> >> >> > >
>> >> >>> >> >> > >
>> >> >>> >> >> >
>> >> >>> >> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >> >>
>> >>
>> >>
>>
>>

Re: Samza as a Caching layer

Reply via email to