Other observation is .. http://10.132.62.185:8088/cluster shows that no applications are running.
- Shekar On Tue, Sep 2, 2014 at 1:15 PM, Shekar Tippur <[email protected]> wrote: > Yarn seem to be running .. > > yarn 5462 0.0 2.0 1641296 161404 ? Sl Jun20 95:26 > /usr/java/jdk1.6.0_31/bin/java -Dproc_resourcemanager -Xmx1000m > -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn > -Dhadoop.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log > -Dyarn.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log > -Dyarn.home.dir=/usr/lib/hadoop-yarn -Dhadoop.home.dir=/usr/lib/hadoop-yarn > -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA > -Djava.library.path=/usr/lib/hadoop/lib/native -classpath > /etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-yarn/lib/*:/etc/hadoop/conf/rm-config/log4j.properties > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager > > I can tail kafka topic as well .. > > deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 > --topic wikipedia-raw > > > > > On Tue, Sep 2, 2014 at 1:04 PM, Chris Riccomini < > [email protected]> wrote: > >> Hey Shekar, >> >> It looks like your job is hanging trying to connect to the RM on your >> localhost. I thought that you said that your job was running in local >> mode. If so, it should be using the LocalJobFactory. If not, and you >> intend to run on YARN, is your YARN RM up and running on localhost? >> >> Cheers, >> Chris >> >> On 9/2/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote: >> >> >Chris .. >> > >> >$ cat ./deploy/samza/undefined-samza-container-name.log >> > >> >2014-09-02 11:17:58 JobRunner [INFO] job factory: >> >org.apache.samza.job.yarn.YarnJobFactory >> > >> >2014-09-02 11:17:59 ClientHelper [INFO] trying to connect to RM >> >127.0.0.1:8032 >> > >> >2014-09-02 11:17:59 NativeCodeLoader [WARN] Unable to load native-hadoop >> >library for your platform... using builtin-java classes where applicable >> > >> >2014-09-02 11:17:59 RMProxy [INFO] Connecting to ResourceManager at / >> >127.0.0.1:8032 >> > >> > >> >and Log4j .. >> > >> ><!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> >> > >> ><log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/"> >> > >> > <appender name="RollingAppender" >> >class="org.apache.log4j.DailyRollingFileAppender"> >> > >> > <param name="File" >> >value="${samza.log.dir}/${samza.container.name}.log" >> >/> >> > >> > <param name="DatePattern" value="'.'yyyy-MM-dd" /> >> > >> > <layout class="org.apache.log4j.PatternLayout"> >> > >> > <param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss} >> %c{1} >> >[%p] %m%n" /> >> > >> > </layout> >> > >> > </appender> >> > >> > <root> >> > >> > <priority value="info" /> >> > >> > <appender-ref ref="RollingAppender"/> >> > >> > </root> >> > >> ></log4j:configuration> >> > >> > >> >On Tue, Sep 2, 2014 at 12:02 PM, Chris Riccomini < >> >[email protected]> wrote: >> > >> >> Hey Shekar, >> >> >> >> Can you attach your log files? I'm wondering if it's a mis-configured >> >> log4j.xml (or missing slf4j-log4j jar), which is leading to nearly >> empty >> >> log files. Also, I'm wondering if the job starts fully. Anything you >> can >> >> attach would be helpful. >> >> >> >> Cheers, >> >> Chris >> >> >> >> On 9/2/14 11:43 AM, "Shekar Tippur" <[email protected]> wrote: >> >> >> >> >I am running in local mode. >> >> > >> >> >S >> >> >On Sep 2, 2014 11:42 AM, "Yan Fang" <[email protected]> wrote: >> >> > >> >> >> Hi Shekar. >> >> >> >> >> >> Are you running job in local mode or yarn mode? If yarn mode, the >> log >> >> >>is in >> >> >> the yarn's container log. >> >> >> >> >> >> Thanks, >> >> >> >> >> >> Fang, Yan >> >> >> [email protected] >> >> >> +1 (206) 849-4108 >> >> >> >> >> >> >> >> >> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur <[email protected]> >> >> >>wrote: >> >> >> >> >> >> > Chris, >> >> >> > >> >> >> > Got some time to play around a bit more. >> >> >> > I tried to edit >> >> >> > >> >> >> > >> >> >> >> >> >> >> >>>>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/WikipediaFe >> >>>>ed >> >> >>StreamTask.java >> >> >> > to add a logger info statement to tap the incoming message. I dont >> >>see >> >> >> the >> >> >> > messages being printed to the log file. >> >> >> > >> >> >> > Is this the right place to start? >> >> >> > >> >> >> > public class WikipediaFeedStreamTask implements StreamTask { >> >> >> > >> >> >> > private static final SystemStream OUTPUT_STREAM = new >> >> >> > SystemStream("kafka", >> >> >> > "wikipedia-raw"); >> >> >> > >> >> >> > private static final Logger log = LoggerFactory.getLogger >> >> >> > (WikipediaFeedStreamTask.class); >> >> >> > >> >> >> > @Override >> >> >> > >> >> >> > public void process(IncomingMessageEnvelope envelope, >> >> >>MessageCollector >> >> >> > collector, TaskCoordinator coordinator) { >> >> >> > >> >> >> > Map<String, Object> outgoingMap = >> >> >> > WikipediaFeedEvent.toMap((WikipediaFeedEvent) >> >>envelope.getMessage()); >> >> >> > >> >> >> > log.info(envelope.getMessage().toString()); >> >> >> > >> >> >> > collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, >> >> >> > outgoingMap)); >> >> >> > >> >> >> > } >> >> >> > >> >> >> > } >> >> >> > >> >> >> > >> >> >> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini < >> >> >> > [email protected]> wrote: >> >> >> > >> >> >> > > Hey Shekar, >> >> >> > > >> >> >> > > Your thought process is on the right track. It's probably best >> to >> >> >>start >> >> >> > > with hello-samza, and modify it to get what you want. To start >> >>with, >> >> >> > > you'll want to: >> >> >> > > >> >> >> > > 1. Write a simple StreamTask that just does something silly like >> >> >>just >> >> >> > > print messages that it receives. >> >> >> > > 2. Write a configuration for the job that consumes from just the >> >> >>stream >> >> >> > > (alerts from different sources). >> >> >> > > 3. Run this to make sure you've got it working. >> >> >> > > 4. Now add your table join. This can be either a change-data >> >>capture >> >> >> > (CDC) >> >> >> > > stream, or via a remote DB call. >> >> >> > > >> >> >> > > That should get you to a point where you've got your job up and >> >> >> running. >> >> >> > > From there, you could create your own Maven project, and migrate >> >> >>your >> >> >> > code >> >> >> > > over accordingly. >> >> >> > > >> >> >> > > Cheers, >> >> >> > > Chris >> >> >> > > >> >> >> > > On 8/24/14 1:42 AM, "Shekar Tippur" <[email protected]> wrote: >> >> >> > > >> >> >> > > >Chris, >> >> >> > > > >> >> >> > > >I have gone thro the documentation and decided that the option >> >> >>that is >> >> >> > > >most >> >> >> > > >suitable for me is stream-table. >> >> >> > > > >> >> >> > > >I see the following things: >> >> >> > > > >> >> >> > > >1. Point samza to a table (database) >> >> >> > > >2. Point Samza to a stream - Alert stream from different >> sources >> >> >> > > >3. Join key like a hostname >> >> >> > > > >> >> >> > > >I have Hello Samza working. To extend that to do what my needs >> >> >>are, I >> >> >> am >> >> >> > > >not sure where to start (Needs more code change OR >> configuration >> >> >> changes >> >> >> > > >OR >> >> >> > > >both)? >> >> >> > > > >> >> >> > > >I have gone thro >> >> >> > > > >> >> >> > >> >> >> >> >> >> >> >> >> >> >> http://samza.incubator.apache.org/learn/documentation/latest/api/overview >> >> >> > > . >> >> >> > > >html >> >> >> > > > >> >> >> > > >Is my thought process on the right track? Can you please point >> >>me >> >> >>to >> >> >> the >> >> >> > > >right direction? >> >> >> > > > >> >> >> > > >- Shekar >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur >> >><[email protected]> >> >> >> > wrote: >> >> >> > > > >> >> >> > > >> Chris, >> >> >> > > >> >> >> >> > > >> This is perfectly good answer. I will start poking more into >> >> >>option >> >> >> > #4. >> >> >> > > >> >> >> >> > > >> - Shekar >> >> >> > > >> >> >> >> > > >> >> >> >> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini < >> >> >> > > >> [email protected]> wrote: >> >> >> > > >> >> >> >> > > >>> Hey Shekar, >> >> >> > > >>> >> >> >> > > >>> Your two options are really (3) or (4), then. You can either >> >>run >> >> >> some >> >> >> > > >>> external DB that holds the data set, and you can query it >> >>from a >> >> >> > > >>> StreamTask, or you can use Samza's state store feature to >> >>push >> >> >>data >> >> >> > > >>>into a >> >> >> > > >>> stream that you can then store in a partitioned key-value >> >>store >> >> >> along >> >> >> > > >>>with >> >> >> > > >>> your StreamTasks. There is some documentation here about the >> >> >>state >> >> >> > > >>>store >> >> >> > > >>> approach: >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > >> >> >> >> >> >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st >> >> >> > > >>>ate >> >> >> > > >>> -management.html >> >> >> > > >>> >> >> >> > > >>>< >> >> >> > > >> >> >> >> >>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/s >> >> >> > > >>>tate-management.html> >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > >>> (4) is going to require more up front effort from you, since >> >> >>you'll >> >> >> > > >>>have >> >> >> > > >>> to understand how Kafka's partitioning model works, and >> setup >> >> >>some >> >> >> > > >>> pipeline to push the updates for your state. In the long >> >>run, I >> >> >> > believe >> >> >> > > >>> it's the better approach, though. Local lookups on a >> >>key-value >> >> >> store >> >> >> > > >>> should be faster than doing remote RPC calls to a DB for >> >>every >> >> >> > message. >> >> >> > > >>> >> >> >> > > >>> I'm sorry I can't give you a more definitive answer. It's >> >>really >> >> >> > about >> >> >> > > >>> trade-offs. >> >> >> > > >>> >> >> >> > > >>> Cheers, >> >> >> > > >>> Chris >> >> >> > > >>> >> >> >> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> >> >>wrote: >> >> >> > > >>> >> >> >> > > >>> >Chris, >> >> >> > > >>> > >> >> >> > > >>> >A big thanks for a swift response. The data set is huge and >> >>the >> >> >> > > >>>frequency >> >> >> > > >>> >is in burst. >> >> >> > > >>> >What do you suggest? >> >> >> > > >>> > >> >> >> > > >>> >- Shekar >> >> >> > > >>> > >> >> >> > > >>> > >> >> >> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini < >> >> >> > > >>> >[email protected]> wrote: >> >> >> > > >>> > >> >> >> > > >>> >> Hey Shekar, >> >> >> > > >>> >> >> >> >> > > >>> >> This is feasible, and you are on the right thought >> >>process. >> >> >> > > >>> >> >> >> >> > > >>> >> For the sake of discussion, I'm going to pretend that you >> >> >>have a >> >> >> > > >>>Kafka >> >> >> > > >>> >> topic called "PageViewEvent", which has just the IP >> >>address >> >> >>that >> >> >> > was >> >> >> > > >>> >>used >> >> >> > > >>> >> to view a page. These messages will be logged every time >> a >> >> >>page >> >> >> > view >> >> >> > > >>> >> happens. I'm also going to pretend that you have some >> >>state >> >> >> called >> >> >> > > >>> >>"IPGeo" >> >> >> > > >>> >> (e.g. The maxmind data set). In this example, we'll want >> >>to >> >> >>join >> >> >> > the >> >> >> > > >>> >> long/lat geo information from IPGeo to the PageViewEvent, >> >>and >> >> >> send >> >> >> > > >>>it >> >> >> > > >>> >>to a >> >> >> > > >>> >> new topic: PageViewEventsWithGeo. >> >> >> > > >>> >> >> >> >> > > >>> >> You have several options on how to implement this >> example. >> >> >> > > >>> >> >> >> >> > > >>> >> 1. If your joining data set (IPGeo) is relatively small >> >>and >> >> >> > changes >> >> >> > > >>> >> infrequently, you can just pack it up in your jar or .tgz >> >> >>file, >> >> >> > and >> >> >> > > >>> open >> >> >> > > >>> >> it open in every StreamTask. >> >> >> > > >>> >> 2. If your data set is small, but changes somewhat >> >> >>frequently, >> >> >> you >> >> >> > > >>>can >> >> >> > > >>> >> throw the data set on some HTTP/HDFS/S3 server somewhere, >> >>and >> >> >> have >> >> >> > > >>>your >> >> >> > > >>> >> StreamTask refresh it periodically by re-downloading it. >> >> >> > > >>> >> 3. You can do remote RPC calls for the IPGeo data on >> every >> >> >>page >> >> >> > view >> >> >> > > >>> >>event >> >> >> > > >>> >> by query some remote service or DB (e.g. Cassandra). >> >> >> > > >>> >> 4. You can use Samza's state feature to set your IPGeo >> >>data >> >> >>as a >> >> >> > > >>>series >> >> >> > > >>> >>of >> >> >> > > >>> >> messages to a log-compacted Kafka topic >> >> >> > > >>> >> ( >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction >> >> >> > ), >> >> >> > > >>> and >> >> >> > > >>> >> configure your Samza job to read this topic as a >> bootstrap >> >> >> stream >> >> >> > > >>> >> ( >> >> >> > > >>> >> >> >> >> > > >>> >> >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > >> >> >> >> >> >> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st >> >> >> > > >>>r >> >> >> > > >>> >>e >> >> >> > > >>> >> ams.html). >> >> >> > > >>> >> >> >> >> > > >>> >> For (4), you'd have to partition the IPGeo state topic >> >> >>according >> >> >> > to >> >> >> > > >>>the >> >> >> > > >>> >> same key as PageViewEvent. If PageViewEvent were >> >>partitioned >> >> >>by, >> >> >> > > >>>say, >> >> >> > > >>> >> member ID, but you want your IPGeo state topic to be >> >> >>partitioned >> >> >> > by >> >> >> > > >>>IP >> >> >> > > >>> >> address, then you'd have to have an upstream job that >> >> >> > re-partitioned >> >> >> > > >>> >> PageViewEvent into some new topic by IP address. This new >> >> >>topic >> >> >> > will >> >> >> > > >>> >>have >> >> >> > > >>> >> to have the same number of partitions as the IPGeo state >> >> >>topic >> >> >> (if >> >> >> > > >>> IPGeo >> >> >> > > >>> >> has 8 partitions, then the new PageViewEventRepartitioned >> >> >>topic >> >> >> > > >>>needs 8 >> >> >> > > >>> >>as >> >> >> > > >>> >> well). This will cause your PageViewEventRepartitioned >> >>topic >> >> >>and >> >> >> > > >>>your >> >> >> > > >>> >> IPGeo state topic to be aligned such that the StreamTask >> >>that >> >> >> gets >> >> >> > > >>>page >> >> >> > > >>> >> views for IP address X will also have the IPGeo >> >>information >> >> >>for >> >> >> IP >> >> >> > > >>> >>address >> >> >> > > >>> >> X. >> >> >> > > >>> >> >> >> >> > > >>> >> Which strategy you pick is really up to you. :) (4) is >> the >> >> >>most >> >> >> > > >>> >> complicated, but also the most flexible, and most >> >> >>operationally >> >> >> > > >>>sound. >> >> >> > > >>> >>(1) >> >> >> > > >>> >> is the easiest if it fits your needs. >> >> >> > > >>> >> >> >> >> > > >>> >> Cheers, >> >> >> > > >>> >> Chris >> >> >> > > >>> >> >> >> >> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur" <[email protected]> >> >> >>wrote: >> >> >> > > >>> >> >> >> >> > > >>> >> >Hello, >> >> >> > > >>> >> > >> >> >> > > >>> >> >I am new to Samza. I have just installed Hello Samza and >> >> >>got it >> >> >> > > >>> >>working. >> >> >> > > >>> >> > >> >> >> > > >>> >> >Here is the use case for which I am trying to use Samza: >> >> >> > > >>> >> > >> >> >> > > >>> >> > >> >> >> > > >>> >> >1. Cache the contextual information which contains more >> >> >> > information >> >> >> > > >>> >>about >> >> >> > > >>> >> >the hostname or IP address using Samza/Yarn/Kafka >> >> >> > > >>> >> >2. Collect alert and metric events which contain either >> >> >> hostname >> >> >> > > >>>or IP >> >> >> > > >>> >> >address >> >> >> > > >>> >> >3. Append contextual information to the alert and metric >> >>and >> >> >> > > >>>insert to >> >> >> > > >>> >>a >> >> >> > > >>> >> >Kafka queue from which other subscribers read off of. >> >> >> > > >>> >> > >> >> >> > > >>> >> >Can you please shed some light on >> >> >> > > >>> >> > >> >> >> > > >>> >> >1. Is this feasible? >> >> >> > > >>> >> >2. Am I on the right thought process >> >> >> > > >>> >> >3. How do I start >> >> >> > > >>> >> > >> >> >> > > >>> >> >I now have 1 & 2 of them working disparately. I need to >> >> >> integrate >> >> >> > > >>> them. >> >> >> > > >>> >> > >> >> >> > > >>> >> >Appreciate any input. >> >> >> > > >>> >> > >> >> >> > > >>> >> >- Shekar >> >> >> > > >>> >> >> >> >> > > >>> >> >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> >> > >> >> >> >> >> >> >> >> >> >
