Re: Samza as a Caching layer

Chris Riccomini Tue, 02 Sep 2014 14:14:55 -0700

Hey Shekar,

Can you try changing that to:


  http://127.0.0.1:8088/cluster


And see if you can connect?

Cheers,
Chris

On 9/2/14 1:21 PM, "Shekar Tippur" <[email protected]> wrote:

>Other observation is ..
>
>http://10.132.62.185:8088/cluster shows that no applications are running.
>
>- Shekar
>
>
>
>
>On Tue, Sep 2, 2014 at 1:15 PM, Shekar Tippur <[email protected]> wrote:
>
>> Yarn seem to be running ..
>>
>> yarn      5462  0.0  2.0 1641296 161404 ?      Sl   Jun20  95:26
>> /usr/java/jdk1.6.0_31/bin/java -Dproc_resourcemanager -Xmx1000m
>> -Dhadoop.log.dir=/var/log/hadoop-yarn
>>-Dyarn.log.dir=/var/log/hadoop-yarn
>> -Dhadoop.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log
>> -Dyarn.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log
>> -Dyarn.home.dir=/usr/lib/hadoop-yarn
>>-Dhadoop.home.dir=/usr/lib/hadoop-yarn
>> -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA
>> -Djava.library.path=/usr/lib/hadoop/lib/native -classpath
>> 
>>/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:
>>/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/
>>usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/
>>.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/
>>lib/hadoop-yarn/.//*:/usr/lib/hadoop-yarn/lib/*:/etc/hadoop/conf/rm-confi
>>g/log4j.properties
>> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
>>
>> I can tail kafka topic as well ..
>>
>> deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181
>>--topic wikipedia-raw
>>
>>
>>
>>
>> On Tue, Sep 2, 2014 at 1:04 PM, Chris Riccomini <
>> [email protected]> wrote:
>>
>>> Hey Shekar,
>>>
>>> It looks like your job is hanging trying to connect to the RM on your
>>> localhost. I thought that you said that your job was running in local
>>> mode. If so, it should be using the LocalJobFactory. If not, and you
>>> intend to run on YARN, is your YARN RM up and running on localhost?
>>>
>>> Cheers,
>>> Chris
>>>
>>> On 9/2/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote:
>>>
>>> >Chris ..
>>> >
>>> >$ cat ./deploy/samza/undefined-samza-container-name.log
>>> >
>>> >2014-09-02 11:17:58 JobRunner [INFO] job factory:
>>> >org.apache.samza.job.yarn.YarnJobFactory
>>> >
>>> >2014-09-02 11:17:59 ClientHelper [INFO] trying to connect to RM
>>> >127.0.0.1:8032
>>> >
>>> >2014-09-02 11:17:59 NativeCodeLoader [WARN] Unable to load
>>>native-hadoop
>>> >library for your platform... using builtin-java classes where
>>>applicable
>>> >
>>> >2014-09-02 11:17:59 RMProxy [INFO] Connecting to ResourceManager at /
>>> >127.0.0.1:8032
>>> >
>>> >
>>> >and Log4j ..
>>> >
>>> ><!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
>>> >
>>> ><log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/";>
>>> >
>>> >  <appender name="RollingAppender"
>>> >class="org.apache.log4j.DailyRollingFileAppender">
>>> >
>>> >     <param name="File"
>>> >value="${samza.log.dir}/${samza.container.name}.log"
>>> >/>
>>> >
>>> >     <param name="DatePattern" value="'.'yyyy-MM-dd" />
>>> >
>>> >     <layout class="org.apache.log4j.PatternLayout">
>>> >
>>> >      <param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss}
>>> %c{1}
>>> >[%p] %m%n" />
>>> >
>>> >     </layout>
>>> >
>>> >  </appender>
>>> >
>>> >  <root>
>>> >
>>> >    <priority value="info" />
>>> >
>>> >    <appender-ref ref="RollingAppender"/>
>>> >
>>> >  </root>
>>> >
>>> ></log4j:configuration>
>>> >
>>> >
>>> >On Tue, Sep 2, 2014 at 12:02 PM, Chris Riccomini <
>>> >[email protected]> wrote:
>>> >
>>> >> Hey Shekar,
>>> >>
>>> >> Can you attach your log files? I'm wondering if it's a
>>>mis-configured
>>> >> log4j.xml (or missing slf4j-log4j jar), which is leading to nearly
>>> empty
>>> >> log files. Also, I'm wondering if the job starts fully. Anything you
>>> can
>>> >> attach would be helpful.
>>> >>
>>> >> Cheers,
>>> >> Chris
>>> >>
>>> >> On 9/2/14 11:43 AM, "Shekar Tippur" <[email protected]> wrote:
>>> >>
>>> >> >I am running in local mode.
>>> >> >
>>> >> >S
>>> >> >On Sep 2, 2014 11:42 AM, "Yan Fang" <[email protected]> wrote:
>>> >> >
>>> >> >> Hi Shekar.
>>> >> >>
>>> >> >> Are you running job in local mode or yarn mode? If yarn mode, the
>>> log
>>> >> >>is in
>>> >> >> the yarn's container log.
>>> >> >>
>>> >> >> Thanks,
>>> >> >>
>>> >> >> Fang, Yan
>>> >> >> [email protected]
>>> >> >> +1 (206) 849-4108
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur
>>><[email protected]>
>>> >> >>wrote:
>>> >> >>
>>> >> >> > Chris,
>>> >> >> >
>>> >> >> > Got some time to play around a bit more.
>>> >> >> > I tried to edit
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/Wikipedi
>>>>>>>aFe
>>> >>>>ed
>>> >> >>StreamTask.java
>>> >> >> > to add a logger info statement to tap the incoming message. I
>>>dont
>>> >>see
>>> >> >> the
>>> >> >> > messages being printed to the log file.
>>> >> >> >
>>> >> >> > Is this the right place to start?
>>> >> >> >
>>> >> >> > public class WikipediaFeedStreamTask implements StreamTask {
>>> >> >> >
>>> >> >> >   private static final SystemStream OUTPUT_STREAM = new
>>> >> >> > SystemStream("kafka",
>>> >> >> > "wikipedia-raw");
>>> >> >> >
>>> >> >> >   private static final Logger log = LoggerFactory.getLogger
>>> >> >> > (WikipediaFeedStreamTask.class);
>>> >> >> >
>>> >> >> >   @Override
>>> >> >> >
>>> >> >> >   public void process(IncomingMessageEnvelope envelope,
>>> >> >>MessageCollector
>>> >> >> > collector, TaskCoordinator coordinator) {
>>> >> >> >
>>> >> >> >     Map<String, Object> outgoingMap =
>>> >> >> > WikipediaFeedEvent.toMap((WikipediaFeedEvent)
>>> >>envelope.getMessage());
>>> >> >> >
>>> >> >> >     log.info(envelope.getMessage().toString());
>>> >> >> >
>>> >> >> >     collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM,
>>> >> >> > outgoingMap));
>>> >> >> >
>>> >> >> >   }
>>> >> >> >
>>> >> >> > }
>>> >> >> >
>>> >> >> >
>>> >> >> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini <
>>> >> >> > [email protected]> wrote:
>>> >> >> >
>>> >> >> > > Hey Shekar,
>>> >> >> > >
>>> >> >> > > Your thought process is on the right track. It's probably
>>>best
>>> to
>>> >> >>start
>>> >> >> > > with hello-samza, and modify it to get what you want. To
>>>start
>>> >>with,
>>> >> >> > > you'll want to:
>>> >> >> > >
>>> >> >> > > 1. Write a simple StreamTask that just does something silly
>>>like
>>> >> >>just
>>> >> >> > > print messages that it receives.
>>> >> >> > > 2. Write a configuration for the job that consumes from just
>>>the
>>> >> >>stream
>>> >> >> > > (alerts from different sources).
>>> >> >> > > 3. Run this to make sure you've got it working.
>>> >> >> > > 4. Now add your table join. This can be either a change-data
>>> >>capture
>>> >> >> > (CDC)
>>> >> >> > > stream, or via a remote DB call.
>>> >> >> > >
>>> >> >> > > That should get you to a point where you've got your job up
>>>and
>>> >> >> running.
>>> >> >> > > From there, you could create your own Maven project, and
>>>migrate
>>> >> >>your
>>> >> >> > code
>>> >> >> > > over accordingly.
>>> >> >> > >
>>> >> >> > > Cheers,
>>> >> >> > > Chris
>>> >> >> > >
>>> >> >> > > On 8/24/14 1:42 AM, "Shekar Tippur" <[email protected]>
>>>wrote:
>>> >> >> > >
>>> >> >> > > >Chris,
>>> >> >> > > >
>>> >> >> > > >I have gone thro the documentation and decided that the
>>>option
>>> >> >>that is
>>> >> >> > > >most
>>> >> >> > > >suitable for me is stream-table.
>>> >> >> > > >
>>> >> >> > > >I see the following things:
>>> >> >> > > >
>>> >> >> > > >1. Point samza to a table (database)
>>> >> >> > > >2. Point Samza to a stream - Alert stream from different
>>> sources
>>> >> >> > > >3. Join key like a hostname
>>> >> >> > > >
>>> >> >> > > >I have Hello Samza working. To extend that to do what my
>>>needs
>>> >> >>are, I
>>> >> >> am
>>> >> >> > > >not sure where to start (Needs more code change OR
>>> configuration
>>> >> >> changes
>>> >> >> > > >OR
>>> >> >> > > >both)?
>>> >> >> > > >
>>> >> >> > > >I have gone thro
>>> >> >> > > >
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >>
>>> >>
>>> 
>>>http://samza.incubator.apache.org/learn/documentation/latest/api/overvie
>>>w
>>> >> >> > > .
>>> >> >> > > >html
>>> >> >> > > >
>>> >> >> > > >Is my thought process on the right track? Can you please
>>>point
>>> >>me
>>> >> >>to
>>> >> >> the
>>> >> >> > > >right direction?
>>> >> >> > > >
>>> >> >> > > >- Shekar
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur
>>> >><[email protected]>
>>> >> >> > wrote:
>>> >> >> > > >
>>> >> >> > > >> Chris,
>>> >> >> > > >>
>>> >> >> > > >> This is perfectly good answer. I will start poking more
>>>into
>>> >> >>option
>>> >> >> > #4.
>>> >> >> > > >>
>>> >> >> > > >> - Shekar
>>> >> >> > > >>
>>> >> >> > > >>
>>> >> >> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini <
>>> >> >> > > >> [email protected]> wrote:
>>> >> >> > > >>
>>> >> >> > > >>> Hey Shekar,
>>> >> >> > > >>>
>>> >> >> > > >>> Your two options are really (3) or (4), then. You can
>>>either
>>> >>run
>>> >> >> some
>>> >> >> > > >>> external DB that holds the data set, and you can query it
>>> >>from a
>>> >> >> > > >>> StreamTask, or you can use Samza's state store feature to
>>> >>push
>>> >> >>data
>>> >> >> > > >>>into a
>>> >> >> > > >>> stream that you can then store in a partitioned key-value
>>> >>store
>>> >> >> along
>>> >> >> > > >>>with
>>> >> >> > > >>> your StreamTasks. There is some documentation here about
>>>the
>>> >> >>state
>>> >> >> > > >>>store
>>> >> >> > > >>> approach:
>>> >> >> > > >>>
>>> >> >> > > >>>
>>> >> >> > > >>>
>>> >> >> > > >>>
>>> >> >> > >
>>> >> >>
>>> >>
>>> 
>>>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
>>> >> >> > > >>>ate
>>> >> >> > > >>> -management.html
>>> >> >> > > >>>
>>> >> >> > > >>><
>>> >> >> > >
>>> >> >>
>>> 
>>>>>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/
>>>>>s
>>> >> >> > > >>>tate-management.html>
>>> >> >> > > >>>
>>> >> >> > > >>>
>>> >> >> > > >>> (4) is going to require more up front effort from you,
>>>since
>>> >> >>you'll
>>> >> >> > > >>>have
>>> >> >> > > >>> to understand how Kafka's partitioning model works, and
>>> setup
>>> >> >>some
>>> >> >> > > >>> pipeline to push the updates for your state. In the long
>>> >>run, I
>>> >> >> > believe
>>> >> >> > > >>> it's the better approach, though. Local lookups on a
>>> >>key-value
>>> >> >> store
>>> >> >> > > >>> should be faster than doing remote RPC calls to a DB for
>>> >>every
>>> >> >> > message.
>>> >> >> > > >>>
>>> >> >> > > >>> I'm sorry I can't give you a more definitive answer. It's
>>> >>really
>>> >> >> > about
>>> >> >> > > >>> trade-offs.
>>> >> >> > > >>>
>>> >> >> > > >>> Cheers,
>>> >> >> > > >>> Chris
>>> >> >> > > >>>
>>> >> >> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]>
>>> >>wrote:
>>> >> >> > > >>>
>>> >> >> > > >>> >Chris,
>>> >> >> > > >>> >
>>> >> >> > > >>> >A big thanks for a swift response. The data set is huge
>>>and
>>> >>the
>>> >> >> > > >>>frequency
>>> >> >> > > >>> >is in burst.
>>> >> >> > > >>> >What do you suggest?
>>> >> >> > > >>> >
>>> >> >> > > >>> >- Shekar
>>> >> >> > > >>> >
>>> >> >> > > >>> >
>>> >> >> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini <
>>> >> >> > > >>> >[email protected]> wrote:
>>> >> >> > > >>> >
>>> >> >> > > >>> >> Hey Shekar,
>>> >> >> > > >>> >>
>>> >> >> > > >>> >> This is feasible, and you are on the right thought
>>> >>process.
>>> >> >> > > >>> >>
>>> >> >> > > >>> >> For the sake of discussion, I'm going to pretend that
>>>you
>>> >> >>have a
>>> >> >> > > >>>Kafka
>>> >> >> > > >>> >> topic called "PageViewEvent", which has just the IP
>>> >>address
>>> >> >>that
>>> >> >> > was
>>> >> >> > > >>> >>used
>>> >> >> > > >>> >> to view a page. These messages will be logged every
>>>time
>>> a
>>> >> >>page
>>> >> >> > view
>>> >> >> > > >>> >> happens. I'm also going to pretend that you have some
>>> >>state
>>> >> >> called
>>> >> >> > > >>> >>"IPGeo"
>>> >> >> > > >>> >> (e.g. The maxmind data set). In this example, we'll
>>>want
>>> >>to
>>> >> >>join
>>> >> >> > the
>>> >> >> > > >>> >> long/lat geo information from IPGeo to the
>>>PageViewEvent,
>>> >>and
>>> >> >> send
>>> >> >> > > >>>it
>>> >> >> > > >>> >>to a
>>> >> >> > > >>> >> new topic: PageViewEventsWithGeo.
>>> >> >> > > >>> >>
>>> >> >> > > >>> >> You have several options on how to implement this
>>> example.
>>> >> >> > > >>> >>
>>> >> >> > > >>> >> 1. If your joining data set (IPGeo) is relatively
>>>small
>>> >>and
>>> >> >> > changes
>>> >> >> > > >>> >> infrequently, you can just pack it up in your jar or
>>>.tgz
>>> >> >>file,
>>> >> >> > and
>>> >> >> > > >>> open
>>> >> >> > > >>> >> it open in every StreamTask.
>>> >> >> > > >>> >> 2. If your data set is small, but changes somewhat
>>> >> >>frequently,
>>> >> >> you
>>> >> >> > > >>>can
>>> >> >> > > >>> >> throw the data set on some HTTP/HDFS/S3 server
>>>somewhere,
>>> >>and
>>> >> >> have
>>> >> >> > > >>>your
>>> >> >> > > >>> >> StreamTask refresh it periodically by re-downloading
>>>it.
>>> >> >> > > >>> >> 3. You can do remote RPC calls for the IPGeo data on
>>> every
>>> >> >>page
>>> >> >> > view
>>> >> >> > > >>> >>event
>>> >> >> > > >>> >> by query some remote service or DB (e.g. Cassandra).
>>> >> >> > > >>> >> 4. You can use Samza's state feature to set your IPGeo
>>> >>data
>>> >> >>as a
>>> >> >> > > >>>series
>>> >> >> > > >>> >>of
>>> >> >> > > >>> >> messages to a log-compacted Kafka topic
>>> >> >> > > >>> >> (
>>> >> >> https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction
>>> >> >> > ),
>>> >> >> > > >>> and
>>> >> >> > > >>> >> configure your Samza job to read this topic as a
>>> bootstrap
>>> >> >> stream
>>> >> >> > > >>> >> (
>>> >> >> > > >>> >>
>>> >> >> > > >>> >>
>>> >> >> > > >>>
>>> >> >> > > >>>
>>> >> >> > >
>>> >> >>
>>> >>
>>> 
>>>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st
>>> >> >> > > >>>r
>>> >> >> > > >>> >>e
>>> >> >> > > >>> >> ams.html).
>>> >> >> > > >>> >>
>>> >> >> > > >>> >> For (4), you'd have to partition the IPGeo state topic
>>> >> >>according
>>> >> >> > to
>>> >> >> > > >>>the
>>> >> >> > > >>> >> same key as PageViewEvent. If PageViewEvent were
>>> >>partitioned
>>> >> >>by,
>>> >> >> > > >>>say,
>>> >> >> > > >>> >> member ID, but you want your IPGeo state topic to be
>>> >> >>partitioned
>>> >> >> > by
>>> >> >> > > >>>IP
>>> >> >> > > >>> >> address, then you'd have to have an upstream job that
>>> >> >> > re-partitioned
>>> >> >> > > >>> >> PageViewEvent into some new topic by IP address. This
>>>new
>>> >> >>topic
>>> >> >> > will
>>> >> >> > > >>> >>have
>>> >> >> > > >>> >> to have the same number of partitions as the IPGeo
>>>state
>>> >> >>topic
>>> >> >> (if
>>> >> >> > > >>> IPGeo
>>> >> >> > > >>> >> has 8 partitions, then the new
>>>PageViewEventRepartitioned
>>> >> >>topic
>>> >> >> > > >>>needs 8
>>> >> >> > > >>> >>as
>>> >> >> > > >>> >> well). This will cause your PageViewEventRepartitioned
>>> >>topic
>>> >> >>and
>>> >> >> > > >>>your
>>> >> >> > > >>> >> IPGeo state topic to be aligned such that the
>>>StreamTask
>>> >>that
>>> >> >> gets
>>> >> >> > > >>>page
>>> >> >> > > >>> >> views for IP address X will also have the IPGeo
>>> >>information
>>> >> >>for
>>> >> >> IP
>>> >> >> > > >>> >>address
>>> >> >> > > >>> >> X.
>>> >> >> > > >>> >>
>>> >> >> > > >>> >> Which strategy you pick is really up to you. :) (4) is
>>> the
>>> >> >>most
>>> >> >> > > >>> >> complicated, but also the most flexible, and most
>>> >> >>operationally
>>> >> >> > > >>>sound.
>>> >> >> > > >>> >>(1)
>>> >> >> > > >>> >> is the easiest if it fits your needs.
>>> >> >> > > >>> >>
>>> >> >> > > >>> >> Cheers,
>>> >> >> > > >>> >> Chris
>>> >> >> > > >>> >>
>>> >> >> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur"
>>><[email protected]>
>>> >> >>wrote:
>>> >> >> > > >>> >>
>>> >> >> > > >>> >> >Hello,
>>> >> >> > > >>> >> >
>>> >> >> > > >>> >> >I am new to Samza. I have just installed Hello Samza
>>>and
>>> >> >>got it
>>> >> >> > > >>> >>working.
>>> >> >> > > >>> >> >
>>> >> >> > > >>> >> >Here is the use case for which I am trying to use
>>>Samza:
>>> >> >> > > >>> >> >
>>> >> >> > > >>> >> >
>>> >> >> > > >>> >> >1. Cache the contextual information which contains
>>>more
>>> >> >> > information
>>> >> >> > > >>> >>about
>>> >> >> > > >>> >> >the hostname or IP address using Samza/Yarn/Kafka
>>> >> >> > > >>> >> >2. Collect alert and metric events which contain
>>>either
>>> >> >> hostname
>>> >> >> > > >>>or IP
>>> >> >> > > >>> >> >address
>>> >> >> > > >>> >> >3. Append contextual information to the alert and
>>>metric
>>> >>and
>>> >> >> > > >>>insert to
>>> >> >> > > >>> >>a
>>> >> >> > > >>> >> >Kafka queue from which other subscribers read off of.
>>> >> >> > > >>> >> >
>>> >> >> > > >>> >> >Can you please shed some light on
>>> >> >> > > >>> >> >
>>> >> >> > > >>> >> >1. Is this feasible?
>>> >> >> > > >>> >> >2. Am I on the right thought process
>>> >> >> > > >>> >> >3. How do I start
>>> >> >> > > >>> >> >
>>> >> >> > > >>> >> >I now have 1 & 2 of them working disparately. I need
>>>to
>>> >> >> integrate
>>> >> >> > > >>> them.
>>> >> >> > > >>> >> >
>>> >> >> > > >>> >> >Appreciate any input.
>>> >> >> > > >>> >> >
>>> >> >> > > >>> >> >- Shekar
>>> >> >> > > >>> >>
>>> >> >> > > >>> >>
>>> >> >> > > >>>
>>> >> >> > > >>>
>>> >> >> > > >>
>>> >> >> > >
>>> >> >> > >
>>> >> >> >
>>> >> >>
>>> >>
>>> >>
>>>
>>>
>>

Re: Samza as a Caching layer

Reply via email to