Hey Shekar, Sorry that you're having such a tough time with this. I'll keep trying to help as best I can.
Do you see slf4j-log4j12 in your job's .tgz file? Cheers, Chris On 9/3/14 2:46 PM, "Shekar Tippur" <[email protected]> wrote: >Chris, > >I have added the below dependencies and I still get the same message. > > > <dependency> > > <groupId>org.slf4j</groupId> > > <artifactId>slf4j-api</artifactId> > > <version>1.7.7</version> > > </dependency> > > <dependency> > > <groupId>org.slf4j</groupId> > > <artifactId>slf4j-log4j12</artifactId> > > <version>1.5.6</version> > > </dependency> > > >On Wed, Sep 3, 2014 at 9:01 AM, Chris Riccomini < >[email protected]> wrote: > >> Hey Shekar, >> >> The SLF4J stuff is saying that you don't have an slf4j binding on your >> classpath. Try adding slf4j-log4j as a runtime dependency on your >>project. >> >> Cheers, >> Chris >> >> On 9/2/14 3:24 PM, "Shekar Tippur" <[email protected]> wrote: >> >> >Chris , >> > >> >In the current state, I just want Samza to connect to 127.0.0.1. >> > >> >I have set YARN_HOME to >> > >> >$ echo $YARN_HOME >> > >> >/home/ctippur/hello-samza/deploy/yarn >> > >> >I still dont see anything on hadoop console. >> > >> > Also, I see this during startup >> > >> > >> >SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". >> > >> >SLF4J: Defaulting to no-operation (NOP) logger implementation >> > >> >SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for >>further >> >details. >> > >> >SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". >> > >> >SLF4J: Defaulting to no-operation (NOP) logger implementation >> > >> >SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for >>further >> >details. >> > >> >- Shekar >> > >> > >> >On Tue, Sep 2, 2014 at 2:20 PM, Chris Riccomini < >> >[email protected]> wrote: >> > >> >> Hey Shekar, >> >> >> >> Ah ha. In that case, do you expect your SamzaContainer to try to >>connect >> >> to the RM at 127.0.0.1, or do you expect it to try to connect to some >> >> remote RM? If you expect it to try and connect to a remote RM, it's >>not >> >> doing that. Perhaps because YARN_HOME isn't set. >> >> >> >> If you go to your RM's web interface, how many active nodes do you >>see >> >> listed? >> >> >> >> Cheers, >> >> Chris >> >> >> >> On 9/2/14 2:17 PM, "Shekar Tippur" <[email protected]> wrote: >> >> >> >> >Chris .. >> >> > >> >> >I am using a rhel server to host all the components (Yarn, kafka, >> >>samza). >> >> >I >> >> >dont have ACLs open to wikipedia. >> >> >I am following .. >> >> > >> >> >> >> >> >>http://samza.incubator.apache.org/learn/tutorials/latest/run-hello-samza- >> >>w >> >> >ithout-internet.html >> >> > >> >> >- Shekar >> >> > >> >> > >> >> > >> >> >On Tue, Sep 2, 2014 at 2:13 PM, Chris Riccomini < >> >> >[email protected]> wrote: >> >> > >> >> >> Hey Shekar, >> >> >> >> >> >> Can you try changing that to: >> >> >> >> >> >> http://127.0.0.1:8088/cluster >> >> >> >> >> >> >> >> >> And see if you can connect? >> >> >> >> >> >> Cheers, >> >> >> Chris >> >> >> >> >> >> On 9/2/14 1:21 PM, "Shekar Tippur" <[email protected]> wrote: >> >> >> >> >> >> >Other observation is .. >> >> >> > >> >> >> >http://10.132.62.185:8088/cluster shows that no applications are >> >> >>running. >> >> >> > >> >> >> >- Shekar >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> >On Tue, Sep 2, 2014 at 1:15 PM, Shekar Tippur <[email protected]> >> >> >>wrote: >> >> >> > >> >> >> >> Yarn seem to be running .. >> >> >> >> >> >> >> >> yarn 5462 0.0 2.0 1641296 161404 ? Sl Jun20 >>95:26 >> >> >> >> /usr/java/jdk1.6.0_31/bin/java -Dproc_resourcemanager -Xmx1000m >> >> >> >> -Dhadoop.log.dir=/var/log/hadoop-yarn >> >> >> >>-Dyarn.log.dir=/var/log/hadoop-yarn >> >> >> >> -Dhadoop.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log >> >> >> >> -Dyarn.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log >> >> >> >> -Dyarn.home.dir=/usr/lib/hadoop-yarn >> >> >> >>-Dhadoop.home.dir=/usr/lib/hadoop-yarn >> >> >> >> -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA >> >> >> >> -Djava.library.path=/usr/lib/hadoop/lib/native -classpath >> >> >> >> >> >> >> >> >> >> >>>>>>>>/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/ >>>>>>>>li >> >>>>>>b/ >> >> >>>>*: >> >> >> >> >> >> >>>>>>>>/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/l >>>>>>>>ib >> >>>>>>/* >> >> >>>>:/ >> >> >> >> >> >> >>>>>>>>usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop >>>>>>>>-y >> >>>>>>ar >> >> >>>>n/ >> >> >> >> >> >> >>>>>>>>.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//* >>>>>>>>:/ >> >>>>>>us >> >> >>>>r/ >> >> >> >> >> >> >>>>>>>>lib/hadoop-yarn/.//*:/usr/lib/hadoop-yarn/lib/*:/etc/hadoop/conf/rm >>>>>>>>-c >> >>>>>>on >> >> >>>>fi >> >> >> >>g/log4j.properties >> >> >> >> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager >> >> >> >> >> >> >> >> I can tail kafka topic as well .. >> >> >> >> >> >> >> >> deploy/kafka/bin/kafka-console-consumer.sh --zookeeper >> >> >>localhost:2181 >> >> >> >>--topic wikipedia-raw >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Sep 2, 2014 at 1:04 PM, Chris Riccomini < >> >> >> >> [email protected]> wrote: >> >> >> >> >> >> >> >>> Hey Shekar, >> >> >> >>> >> >> >> >>> It looks like your job is hanging trying to connect to the RM >>on >> >> >>your >> >> >> >>> localhost. I thought that you said that your job was running >>in >> >> >>local >> >> >> >>> mode. If so, it should be using the LocalJobFactory. If not, >>and >> >>you >> >> >> >>> intend to run on YARN, is your YARN RM up and running on >> >>localhost? >> >> >> >>> >> >> >> >>> Cheers, >> >> >> >>> Chris >> >> >> >>> >> >> >> >>> On 9/2/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote: >> >> >> >>> >> >> >> >>> >Chris .. >> >> >> >>> > >> >> >> >>> >$ cat ./deploy/samza/undefined-samza-container-name.log >> >> >> >>> > >> >> >> >>> >2014-09-02 11:17:58 JobRunner [INFO] job factory: >> >> >> >>> >org.apache.samza.job.yarn.YarnJobFactory >> >> >> >>> > >> >> >> >>> >2014-09-02 11:17:59 ClientHelper [INFO] trying to connect to >>RM >> >> >> >>> >127.0.0.1:8032 >> >> >> >>> > >> >> >> >>> >2014-09-02 11:17:59 NativeCodeLoader [WARN] Unable to load >> >> >> >>>native-hadoop >> >> >> >>> >library for your platform... using builtin-java classes where >> >> >> >>>applicable >> >> >> >>> > >> >> >> >>> >2014-09-02 11:17:59 RMProxy [INFO] Connecting to >>ResourceManager >> >> >>at / >> >> >> >>> >127.0.0.1:8032 >> >> >> >>> > >> >> >> >>> > >> >> >> >>> >and Log4j .. >> >> >> >>> > >> >> >> >>> ><!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> >> >> >> >>> > >> >> >> >>> ><log4j:configuration >> >> >>xmlns:log4j="http://jakarta.apache.org/log4j/"> >> >> >> >>> > >> >> >> >>> > <appender name="RollingAppender" >> >> >> >>> >class="org.apache.log4j.DailyRollingFileAppender"> >> >> >> >>> > >> >> >> >>> > <param name="File" >> >> >> >>> >value="${samza.log.dir}/${samza.container.name}.log" >> >> >> >>> >/> >> >> >> >>> > >> >> >> >>> > <param name="DatePattern" value="'.'yyyy-MM-dd" /> >> >> >> >>> > >> >> >> >>> > <layout class="org.apache.log4j.PatternLayout"> >> >> >> >>> > >> >> >> >>> > <param name="ConversionPattern" value="%d{yyyy-MM-dd >> >> >>HH:mm:ss} >> >> >> >>> %c{1} >> >> >> >>> >[%p] %m%n" /> >> >> >> >>> > >> >> >> >>> > </layout> >> >> >> >>> > >> >> >> >>> > </appender> >> >> >> >>> > >> >> >> >>> > <root> >> >> >> >>> > >> >> >> >>> > <priority value="info" /> >> >> >> >>> > >> >> >> >>> > <appender-ref ref="RollingAppender"/> >> >> >> >>> > >> >> >> >>> > </root> >> >> >> >>> > >> >> >> >>> ></log4j:configuration> >> >> >> >>> > >> >> >> >>> > >> >> >> >>> >On Tue, Sep 2, 2014 at 12:02 PM, Chris Riccomini < >> >> >> >>> >[email protected]> wrote: >> >> >> >>> > >> >> >> >>> >> Hey Shekar, >> >> >> >>> >> >> >> >> >>> >> Can you attach your log files? I'm wondering if it's a >> >> >> >>>mis-configured >> >> >> >>> >> log4j.xml (or missing slf4j-log4j jar), which is leading to >> >> >>nearly >> >> >> >>> empty >> >> >> >>> >> log files. Also, I'm wondering if the job starts fully. >> >>Anything >> >> >>you >> >> >> >>> can >> >> >> >>> >> attach would be helpful. >> >> >> >>> >> >> >> >> >>> >> Cheers, >> >> >> >>> >> Chris >> >> >> >>> >> >> >> >> >>> >> On 9/2/14 11:43 AM, "Shekar Tippur" <[email protected]> >> wrote: >> >> >> >>> >> >> >> >> >>> >> >I am running in local mode. >> >> >> >>> >> > >> >> >> >>> >> >S >> >> >> >>> >> >On Sep 2, 2014 11:42 AM, "Yan Fang" <[email protected]> >> >> >>wrote: >> >> >> >>> >> > >> >> >> >>> >> >> Hi Shekar. >> >> >> >>> >> >> >> >> >> >>> >> >> Are you running job in local mode or yarn mode? If yarn >> >>mode, >> >> >>the >> >> >> >>> log >> >> >> >>> >> >>is in >> >> >> >>> >> >> the yarn's container log. >> >> >> >>> >> >> >> >> >> >>> >> >> Thanks, >> >> >> >>> >> >> >> >> >> >>> >> >> Fang, Yan >> >> >> >>> >> >> [email protected] >> >> >> >>> >> >> +1 (206) 849-4108 >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur >> >> >> >>><[email protected]> >> >> >> >>> >> >>wrote: >> >> >> >>> >> >> >> >> >> >>> >> >> > Chris, >> >> >> >>> >> >> > >> >> >> >>> >> >> > Got some time to play around a bit more. >> >> >> >>> >> >> > I tried to edit >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> >>>>>>>>>>>>>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/Wi >>>>>>>>>>>>>ki >> >>>>>>>>>>>pe >> >> >>>>>>>>>di >> >> >> >>>>>>>aFe >> >> >> >>> >>>>ed >> >> >> >>> >> >>StreamTask.java >> >> >> >>> >> >> > to add a logger info statement to tap the incoming >> >>message. >> >> >>I >> >> >> >>>dont >> >> >> >>> >>see >> >> >> >>> >> >> the >> >> >> >>> >> >> > messages being printed to the log file. >> >> >> >>> >> >> > >> >> >> >>> >> >> > Is this the right place to start? >> >> >> >>> >> >> > >> >> >> >>> >> >> > public class WikipediaFeedStreamTask implements >> >>StreamTask { >> >> >> >>> >> >> > >> >> >> >>> >> >> > private static final SystemStream OUTPUT_STREAM = >>new >> >> >> >>> >> >> > SystemStream("kafka", >> >> >> >>> >> >> > "wikipedia-raw"); >> >> >> >>> >> >> > >> >> >> >>> >> >> > private static final Logger log = >> >>LoggerFactory.getLogger >> >> >> >>> >> >> > (WikipediaFeedStreamTask.class); >> >> >> >>> >> >> > >> >> >> >>> >> >> > @Override >> >> >> >>> >> >> > >> >> >> >>> >> >> > public void process(IncomingMessageEnvelope >>envelope, >> >> >> >>> >> >>MessageCollector >> >> >> >>> >> >> > collector, TaskCoordinator coordinator) { >> >> >> >>> >> >> > >> >> >> >>> >> >> > Map<String, Object> outgoingMap = >> >> >> >>> >> >> > WikipediaFeedEvent.toMap((WikipediaFeedEvent) >> >> >> >>> >>envelope.getMessage()); >> >> >> >>> >> >> > >> >> >> >>> >> >> > log.info(envelope.getMessage().toString()); >> >> >> >>> >> >> > >> >> >> >>> >> >> > collector.send(new >> >> >>OutgoingMessageEnvelope(OUTPUT_STREAM, >> >> >> >>> >> >> > outgoingMap)); >> >> >> >>> >> >> > >> >> >> >>> >> >> > } >> >> >> >>> >> >> > >> >> >> >>> >> >> > } >> >> >> >>> >> >> > >> >> >> >>> >> >> > >> >> >> >>> >> >> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini < >> >> >> >>> >> >> > [email protected]> wrote: >> >> >> >>> >> >> > >> >> >> >>> >> >> > > Hey Shekar, >> >> >> >>> >> >> > > >> >> >> >>> >> >> > > Your thought process is on the right track. It's >> >>probably >> >> >> >>>best >> >> >> >>> to >> >> >> >>> >> >>start >> >> >> >>> >> >> > > with hello-samza, and modify it to get what you >>want. >> >>To >> >> >> >>>start >> >> >> >>> >>with, >> >> >> >>> >> >> > > you'll want to: >> >> >> >>> >> >> > > >> >> >> >>> >> >> > > 1. Write a simple StreamTask that just does >>something >> >> >>silly >> >> >> >>>like >> >> >> >>> >> >>just >> >> >> >>> >> >> > > print messages that it receives. >> >> >> >>> >> >> > > 2. Write a configuration for the job that consumes >>from >> >> >>just >> >> >> >>>the >> >> >> >>> >> >>stream >> >> >> >>> >> >> > > (alerts from different sources). >> >> >> >>> >> >> > > 3. Run this to make sure you've got it working. >> >> >> >>> >> >> > > 4. Now add your table join. This can be either a >> >> >>change-data >> >> >> >>> >>capture >> >> >> >>> >> >> > (CDC) >> >> >> >>> >> >> > > stream, or via a remote DB call. >> >> >> >>> >> >> > > >> >> >> >>> >> >> > > That should get you to a point where you've got your >> >>job >> >> >>up >> >> >> >>>and >> >> >> >>> >> >> running. >> >> >> >>> >> >> > > From there, you could create your own Maven project, >> >>and >> >> >> >>>migrate >> >> >> >>> >> >>your >> >> >> >>> >> >> > code >> >> >> >>> >> >> > > over accordingly. >> >> >> >>> >> >> > > >> >> >> >>> >> >> > > Cheers, >> >> >> >>> >> >> > > Chris >> >> >> >>> >> >> > > >> >> >> >>> >> >> > > On 8/24/14 1:42 AM, "Shekar Tippur" >><[email protected] >> > >> >> >> >>>wrote: >> >> >> >>> >> >> > > >> >> >> >>> >> >> > > >Chris, >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > >I have gone thro the documentation and decided that >> >>the >> >> >> >>>option >> >> >> >>> >> >>that is >> >> >> >>> >> >> > > >most >> >> >> >>> >> >> > > >suitable for me is stream-table. >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > >I see the following things: >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > >1. Point samza to a table (database) >> >> >> >>> >> >> > > >2. Point Samza to a stream - Alert stream from >> >>different >> >> >> >>> sources >> >> >> >>> >> >> > > >3. Join key like a hostname >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > >I have Hello Samza working. To extend that to do >>what >> >>my >> >> >> >>>needs >> >> >> >>> >> >>are, I >> >> >> >>> >> >> am >> >> >> >>> >> >> > > >not sure where to start (Needs more code change OR >> >> >> >>> configuration >> >> >> >>> >> >> changes >> >> >> >>> >> >> > > >OR >> >> >> >>> >> >> > > >both)? >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > >I have gone thro >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> http://samza.incubator.apache.org/learn/documentation/latest/api/overvie >> >> >> >>>w >> >> >> >>> >> >> > > . >> >> >> >>> >> >> > > >html >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > >Is my thought process on the right track? Can you >> >>please >> >> >> >>>point >> >> >> >>> >>me >> >> >> >>> >> >>to >> >> >> >>> >> >> the >> >> >> >>> >> >> > > >right direction? >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > >- Shekar >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur >> >> >> >>> >><[email protected]> >> >> >> >>> >> >> > wrote: >> >> >> >>> >> >> > > > >> >> >> >>> >> >> > > >> Chris, >> >> >> >>> >> >> > > >> >> >> >> >>> >> >> > > >> This is perfectly good answer. I will start >>poking >> >>more >> >> >> >>>into >> >> >> >>> >> >>option >> >> >> >>> >> >> > #4. >> >> >> >>> >> >> > > >> >> >> >> >>> >> >> > > >> - Shekar >> >> >> >>> >> >> > > >> >> >> >> >>> >> >> > > >> >> >> >> >>> >> >> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini >>< >> >> >> >>> >> >> > > >> [email protected]> wrote: >> >> >> >>> >> >> > > >> >> >> >> >>> >> >> > > >>> Hey Shekar, >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> Your two options are really (3) or (4), then. >>You >> >>can >> >> >> >>>either >> >> >> >>> >>run >> >> >> >>> >> >> some >> >> >> >>> >> >> > > >>> external DB that holds the data set, and you can >> >> >>query it >> >> >> >>> >>from a >> >> >> >>> >> >> > > >>> StreamTask, or you can use Samza's state store >> >> >>feature to >> >> >> >>> >>push >> >> >> >>> >> >>data >> >> >> >>> >> >> > > >>>into a >> >> >> >>> >> >> > > >>> stream that you can then store in a partitioned >> >> >>key-value >> >> >> >>> >>store >> >> >> >>> >> >> along >> >> >> >>> >> >> > > >>>with >> >> >> >>> >> >> > > >>> your StreamTasks. There is some documentation >>here >> >> >>about >> >> >> >>>the >> >> >> >>> >> >>state >> >> >> >>> >> >> > > >>>store >> >> >> >>> >> >> > > >>> approach: >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st >> >> >> >>> >> >> > > >>>ate >> >> >> >>> >> >> > > >>> -management.html >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>>< >> >> >> >>> >> >> > > >> >> >> >>> >> >> >> >> >> >>> >> >> >> >>>>> >> >> >> >> >>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/ >> >> >> >>>>>s >> >> >> >>> >> >> > > >>>tate-management.html> >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> (4) is going to require more up front effort >>from >> >>you, >> >> >> >>>since >> >> >> >>> >> >>you'll >> >> >> >>> >> >> > > >>>have >> >> >> >>> >> >> > > >>> to understand how Kafka's partitioning model >>works, >> >> >>and >> >> >> >>> setup >> >> >> >>> >> >>some >> >> >> >>> >> >> > > >>> pipeline to push the updates for your state. In >>the >> >> >>long >> >> >> >>> >>run, I >> >> >> >>> >> >> > believe >> >> >> >>> >> >> > > >>> it's the better approach, though. Local lookups >>on >> >>a >> >> >> >>> >>key-value >> >> >> >>> >> >> store >> >> >> >>> >> >> > > >>> should be faster than doing remote RPC calls to >>a >> >>DB >> >> >>for >> >> >> >>> >>every >> >> >> >>> >> >> > message. >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> I'm sorry I can't give you a more definitive >> >>answer. >> >> >>It's >> >> >> >>> >>really >> >> >> >>> >> >> > about >> >> >> >>> >> >> > > >>> trade-offs. >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> Cheers, >> >> >> >>> >> >> > > >>> Chris >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur" >> >> >><[email protected]> >> >> >> >>> >>wrote: >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> >Chris, >> >> >> >>> >> >> > > >>> > >> >> >> >>> >> >> > > >>> >A big thanks for a swift response. The data >>set is >> >> >>huge >> >> >> >>>and >> >> >> >>> >>the >> >> >> >>> >> >> > > >>>frequency >> >> >> >>> >> >> > > >>> >is in burst. >> >> >> >>> >> >> > > >>> >What do you suggest? >> >> >> >>> >> >> > > >>> > >> >> >> >>> >> >> > > >>> >- Shekar >> >> >> >>> >> >> > > >>> > >> >> >> >>> >> >> > > >>> > >> >> >> >>> >> >> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris >>Riccomini >> >>< >> >> >> >>> >> >> > > >>> >[email protected]> wrote: >> >> >> >>> >> >> > > >>> > >> >> >> >>> >> >> > > >>> >> Hey Shekar, >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> This is feasible, and you are on the right >> >>thought >> >> >> >>> >>process. >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> For the sake of discussion, I'm going to >>pretend >> >> >>that >> >> >> >>>you >> >> >> >>> >> >>have a >> >> >> >>> >> >> > > >>>Kafka >> >> >> >>> >> >> > > >>> >> topic called "PageViewEvent", which has just >> >>the IP >> >> >> >>> >>address >> >> >> >>> >> >>that >> >> >> >>> >> >> > was >> >> >> >>> >> >> > > >>> >>used >> >> >> >>> >> >> > > >>> >> to view a page. These messages will be logged >> >>every >> >> >> >>>time >> >> >> >>> a >> >> >> >>> >> >>page >> >> >> >>> >> >> > view >> >> >> >>> >> >> > > >>> >> happens. I'm also going to pretend that you >>have >> >> >>some >> >> >> >>> >>state >> >> >> >>> >> >> called >> >> >> >>> >> >> > > >>> >>"IPGeo" >> >> >> >>> >> >> > > >>> >> (e.g. The maxmind data set). In this example, >> >>we'll >> >> >> >>>want >> >> >> >>> >>to >> >> >> >>> >> >>join >> >> >> >>> >> >> > the >> >> >> >>> >> >> > > >>> >> long/lat geo information from IPGeo to the >> >> >> >>>PageViewEvent, >> >> >> >>> >>and >> >> >> >>> >> >> send >> >> >> >>> >> >> > > >>>it >> >> >> >>> >> >> > > >>> >>to a >> >> >> >>> >> >> > > >>> >> new topic: PageViewEventsWithGeo. >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> You have several options on how to implement >> >>this >> >> >> >>> example. >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> 1. If your joining data set (IPGeo) is >> >>relatively >> >> >> >>>small >> >> >> >>> >>and >> >> >> >>> >> >> > changes >> >> >> >>> >> >> > > >>> >> infrequently, you can just pack it up in your >> >>jar >> >> >>or >> >> >> >>>.tgz >> >> >> >>> >> >>file, >> >> >> >>> >> >> > and >> >> >> >>> >> >> > > >>> open >> >> >> >>> >> >> > > >>> >> it open in every StreamTask. >> >> >> >>> >> >> > > >>> >> 2. If your data set is small, but changes >> >>somewhat >> >> >> >>> >> >>frequently, >> >> >> >>> >> >> you >> >> >> >>> >> >> > > >>>can >> >> >> >>> >> >> > > >>> >> throw the data set on some HTTP/HDFS/S3 >>server >> >> >> >>>somewhere, >> >> >> >>> >>and >> >> >> >>> >> >> have >> >> >> >>> >> >> > > >>>your >> >> >> >>> >> >> > > >>> >> StreamTask refresh it periodically by >> >> >>re-downloading >> >> >> >>>it. >> >> >> >>> >> >> > > >>> >> 3. You can do remote RPC calls for the IPGeo >> >>data >> >> >>on >> >> >> >>> every >> >> >> >>> >> >>page >> >> >> >>> >> >> > view >> >> >> >>> >> >> > > >>> >>event >> >> >> >>> >> >> > > >>> >> by query some remote service or DB (e.g. >> >> >>Cassandra). >> >> >> >>> >> >> > > >>> >> 4. You can use Samza's state feature to set >>your >> >> >>IPGeo >> >> >> >>> >>data >> >> >> >>> >> >>as a >> >> >> >>> >> >> > > >>>series >> >> >> >>> >> >> > > >>> >>of >> >> >> >>> >> >> > > >>> >> messages to a log-compacted Kafka topic >> >> >> >>> >> >> > > >>> >> ( >> >> >> >>> >> >> >> >> >>https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction >> >> >> >>> >> >> > ), >> >> >> >>> >> >> > > >>> and >> >> >> >>> >> >> > > >>> >> configure your Samza job to read this topic >>as a >> >> >> >>> bootstrap >> >> >> >>> >> >> stream >> >> >> >>> >> >> > > >>> >> ( >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st >> >> >> >>> >> >> > > >>>r >> >> >> >>> >> >> > > >>> >>e >> >> >> >>> >> >> > > >>> >> ams.html). >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> For (4), you'd have to partition the IPGeo >>state >> >> >>topic >> >> >> >>> >> >>according >> >> >> >>> >> >> > to >> >> >> >>> >> >> > > >>>the >> >> >> >>> >> >> > > >>> >> same key as PageViewEvent. If PageViewEvent >>were >> >> >> >>> >>partitioned >> >> >> >>> >> >>by, >> >> >> >>> >> >> > > >>>say, >> >> >> >>> >> >> > > >>> >> member ID, but you want your IPGeo state >>topic >> >>to >> >> >>be >> >> >> >>> >> >>partitioned >> >> >> >>> >> >> > by >> >> >> >>> >> >> > > >>>IP >> >> >> >>> >> >> > > >>> >> address, then you'd have to have an upstream >>job >> >> >>that >> >> >> >>> >> >> > re-partitioned >> >> >> >>> >> >> > > >>> >> PageViewEvent into some new topic by IP >>address. >> >> >>This >> >> >> >>>new >> >> >> >>> >> >>topic >> >> >> >>> >> >> > will >> >> >> >>> >> >> > > >>> >>have >> >> >> >>> >> >> > > >>> >> to have the same number of partitions as the >> >>IPGeo >> >> >> >>>state >> >> >> >>> >> >>topic >> >> >> >>> >> >> (if >> >> >> >>> >> >> > > >>> IPGeo >> >> >> >>> >> >> > > >>> >> has 8 partitions, then the new >> >> >> >>>PageViewEventRepartitioned >> >> >> >>> >> >>topic >> >> >> >>> >> >> > > >>>needs 8 >> >> >> >>> >> >> > > >>> >>as >> >> >> >>> >> >> > > >>> >> well). This will cause your >> >> >>PageViewEventRepartitioned >> >> >> >>> >>topic >> >> >> >>> >> >>and >> >> >> >>> >> >> > > >>>your >> >> >> >>> >> >> > > >>> >> IPGeo state topic to be aligned such that the >> >> >> >>>StreamTask >> >> >> >>> >>that >> >> >> >>> >> >> gets >> >> >> >>> >> >> > > >>>page >> >> >> >>> >> >> > > >>> >> views for IP address X will also have the >>IPGeo >> >> >> >>> >>information >> >> >> >>> >> >>for >> >> >> >>> >> >> IP >> >> >> >>> >> >> > > >>> >>address >> >> >> >>> >> >> > > >>> >> X. >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> Which strategy you pick is really up to you. >>:) >> >> >>(4) is >> >> >> >>> the >> >> >> >>> >> >>most >> >> >> >>> >> >> > > >>> >> complicated, but also the most flexible, and >> >>most >> >> >> >>> >> >>operationally >> >> >> >>> >> >> > > >>>sound. >> >> >> >>> >> >> > > >>> >>(1) >> >> >> >>> >> >> > > >>> >> is the easiest if it fits your needs. >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> Cheers, >> >> >> >>> >> >> > > >>> >> Chris >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur" >> >> >> >>><[email protected]> >> >> >> >>> >> >>wrote: >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> >Hello, >> >> >> >>> >> >> > > >>> >> > >> >> >> >>> >> >> > > >>> >> >I am new to Samza. I have just installed >>Hello >> >> >>Samza >> >> >> >>>and >> >> >> >>> >> >>got it >> >> >> >>> >> >> > > >>> >>working. >> >> >> >>> >> >> > > >>> >> > >> >> >> >>> >> >> > > >>> >> >Here is the use case for which I am trying >>to >> >>use >> >> >> >>>Samza: >> >> >> >>> >> >> > > >>> >> > >> >> >> >>> >> >> > > >>> >> > >> >> >> >>> >> >> > > >>> >> >1. Cache the contextual information which >> >>contains >> >> >> >>>more >> >> >> >>> >> >> > information >> >> >> >>> >> >> > > >>> >>about >> >> >> >>> >> >> > > >>> >> >the hostname or IP address using >> >>Samza/Yarn/Kafka >> >> >> >>> >> >> > > >>> >> >2. Collect alert and metric events which >> >>contain >> >> >> >>>either >> >> >> >>> >> >> hostname >> >> >> >>> >> >> > > >>>or IP >> >> >> >>> >> >> > > >>> >> >address >> >> >> >>> >> >> > > >>> >> >3. Append contextual information to the >>alert >> >>and >> >> >> >>>metric >> >> >> >>> >>and >> >> >> >>> >> >> > > >>>insert to >> >> >> >>> >> >> > > >>> >>a >> >> >> >>> >> >> > > >>> >> >Kafka queue from which other subscribers >>read >> >>off >> >> >>of. >> >> >> >>> >> >> > > >>> >> > >> >> >> >>> >> >> > > >>> >> >Can you please shed some light on >> >> >> >>> >> >> > > >>> >> > >> >> >> >>> >> >> > > >>> >> >1. Is this feasible? >> >> >> >>> >> >> > > >>> >> >2. Am I on the right thought process >> >> >> >>> >> >> > > >>> >> >3. How do I start >> >> >> >>> >> >> > > >>> >> > >> >> >> >>> >> >> > > >>> >> >I now have 1 & 2 of them working >>disparately. I >> >> >>need >> >> >> >>>to >> >> >> >>> >> >> integrate >> >> >> >>> >> >> > > >>> them. >> >> >> >>> >> >> > > >>> >> > >> >> >> >>> >> >> > > >>> >> >Appreciate any input. >> >> >> >>> >> >> > > >>> >> > >> >> >> >>> >> >> > > >>> >> >- Shekar >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >>> >> >> >> >>> >> >> > > >> >> >> >> >>> >> >> > > >> >> >> >>> >> >> > > >> >> >> >>> >> >> > >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>
