Hey Shekar, Can you try changing that to:
http://127.0.0.1:8088/cluster And see if you can connect? Cheers, Chris On 9/2/14 1:21 PM, "Shekar Tippur" <[email protected]> wrote: >Other observation is .. > >http://10.132.62.185:8088/cluster shows that no applications are running. > >- Shekar > > > > >On Tue, Sep 2, 2014 at 1:15 PM, Shekar Tippur <[email protected]> wrote: > >> Yarn seem to be running .. >> >> yarn 5462 0.0 2.0 1641296 161404 ? Sl Jun20 95:26 >> /usr/java/jdk1.6.0_31/bin/java -Dproc_resourcemanager -Xmx1000m >> -Dhadoop.log.dir=/var/log/hadoop-yarn >>-Dyarn.log.dir=/var/log/hadoop-yarn >> -Dhadoop.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log >> -Dyarn.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log >> -Dyarn.home.dir=/usr/lib/hadoop-yarn >>-Dhadoop.home.dir=/usr/lib/hadoop-yarn >> -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA >> -Djava.library.path=/usr/lib/hadoop/lib/native -classpath >> >>/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*: >>/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/ >>usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/ >>.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/ >>lib/hadoop-yarn/.//*:/usr/lib/hadoop-yarn/lib/*:/etc/hadoop/conf/rm-confi >>g/log4j.properties >> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager >> >> I can tail kafka topic as well .. >> >> deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 >>--topic wikipedia-raw >> >> >> >> >> On Tue, Sep 2, 2014 at 1:04 PM, Chris Riccomini < >> [email protected]> wrote: >> >>> Hey Shekar, >>> >>> It looks like your job is hanging trying to connect to the RM on your >>> localhost. I thought that you said that your job was running in local >>> mode. If so, it should be using the LocalJobFactory. If not, and you >>> intend to run on YARN, is your YARN RM up and running on localhost? >>> >>> Cheers, >>> Chris >>> >>> On 9/2/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote: >>> >>> >Chris .. >>> > >>> >$ cat ./deploy/samza/undefined-samza-container-name.log >>> > >>> >2014-09-02 11:17:58 JobRunner [INFO] job factory: >>> >org.apache.samza.job.yarn.YarnJobFactory >>> > >>> >2014-09-02 11:17:59 ClientHelper [INFO] trying to connect to RM >>> >127.0.0.1:8032 >>> > >>> >2014-09-02 11:17:59 NativeCodeLoader [WARN] Unable to load >>>native-hadoop >>> >library for your platform... using builtin-java classes where >>>applicable >>> > >>> >2014-09-02 11:17:59 RMProxy [INFO] Connecting to ResourceManager at / >>> >127.0.0.1:8032 >>> > >>> > >>> >and Log4j .. >>> > >>> ><!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> >>> > >>> ><log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/"> >>> > >>> > <appender name="RollingAppender" >>> >class="org.apache.log4j.DailyRollingFileAppender"> >>> > >>> > <param name="File" >>> >value="${samza.log.dir}/${samza.container.name}.log" >>> >/> >>> > >>> > <param name="DatePattern" value="'.'yyyy-MM-dd" /> >>> > >>> > <layout class="org.apache.log4j.PatternLayout"> >>> > >>> > <param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss} >>> %c{1} >>> >[%p] %m%n" /> >>> > >>> > </layout> >>> > >>> > </appender> >>> > >>> > <root> >>> > >>> > <priority value="info" /> >>> > >>> > <appender-ref ref="RollingAppender"/> >>> > >>> > </root> >>> > >>> ></log4j:configuration> >>> > >>> > >>> >On Tue, Sep 2, 2014 at 12:02 PM, Chris Riccomini < >>> >[email protected]> wrote: >>> > >>> >> Hey Shekar, >>> >> >>> >> Can you attach your log files? I'm wondering if it's a >>>mis-configured >>> >> log4j.xml (or missing slf4j-log4j jar), which is leading to nearly >>> empty >>> >> log files. Also, I'm wondering if the job starts fully. Anything you >>> can >>> >> attach would be helpful. >>> >> >>> >> Cheers, >>> >> Chris >>> >> >>> >> On 9/2/14 11:43 AM, "Shekar Tippur" <[email protected]> wrote: >>> >> >>> >> >I am running in local mode. >>> >> > >>> >> >S >>> >> >On Sep 2, 2014 11:42 AM, "Yan Fang" <[email protected]> wrote: >>> >> > >>> >> >> Hi Shekar. >>> >> >> >>> >> >> Are you running job in local mode or yarn mode? If yarn mode, the >>> log >>> >> >>is in >>> >> >> the yarn's container log. >>> >> >> >>> >> >> Thanks, >>> >> >> >>> >> >> Fang, Yan >>> >> >> [email protected] >>> >> >> +1 (206) 849-4108 >>> >> >> >>> >> >> >>> >> >> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur >>><[email protected]> >>> >> >>wrote: >>> >> >> >>> >> >> > Chris, >>> >> >> > >>> >> >> > Got some time to play around a bit more. >>> >> >> > I tried to edit >>> >> >> > >>> >> >> > >>> >> >> >>> >> >>> >>> >>>>>>>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/Wikipedi >>>>>>>aFe >>> >>>>ed >>> >> >>StreamTask.java >>> >> >> > to add a logger info statement to tap the incoming message. I >>>dont >>> >>see >>> >> >> the >>> >> >> > messages being printed to the log file. >>> >> >> > >>> >> >> > Is this the right place to start? >>> >> >> > >>> >> >> > public class WikipediaFeedStreamTask implements StreamTask { >>> >> >> > >>> >> >> > private static final SystemStream OUTPUT_STREAM = new >>> >> >> > SystemStream("kafka", >>> >> >> > "wikipedia-raw"); >>> >> >> > >>> >> >> > private static final Logger log = LoggerFactory.getLogger >>> >> >> > (WikipediaFeedStreamTask.class); >>> >> >> > >>> >> >> > @Override >>> >> >> > >>> >> >> > public void process(IncomingMessageEnvelope envelope, >>> >> >>MessageCollector >>> >> >> > collector, TaskCoordinator coordinator) { >>> >> >> > >>> >> >> > Map<String, Object> outgoingMap = >>> >> >> > WikipediaFeedEvent.toMap((WikipediaFeedEvent) >>> >>envelope.getMessage()); >>> >> >> > >>> >> >> > log.info(envelope.getMessage().toString()); >>> >> >> > >>> >> >> > collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, >>> >> >> > outgoingMap)); >>> >> >> > >>> >> >> > } >>> >> >> > >>> >> >> > } >>> >> >> > >>> >> >> > >>> >> >> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini < >>> >> >> > [email protected]> wrote: >>> >> >> > >>> >> >> > > Hey Shekar, >>> >> >> > > >>> >> >> > > Your thought process is on the right track. It's probably >>>best >>> to >>> >> >>start >>> >> >> > > with hello-samza, and modify it to get what you want. To >>>start >>> >>with, >>> >> >> > > you'll want to: >>> >> >> > > >>> >> >> > > 1. Write a simple StreamTask that just does something silly >>>like >>> >> >>just >>> >> >> > > print messages that it receives. >>> >> >> > > 2. Write a configuration for the job that consumes from just >>>the >>> >> >>stream >>> >> >> > > (alerts from different sources). >>> >> >> > > 3. Run this to make sure you've got it working. >>> >> >> > > 4. Now add your table join. This can be either a change-data >>> >>capture >>> >> >> > (CDC) >>> >> >> > > stream, or via a remote DB call. >>> >> >> > > >>> >> >> > > That should get you to a point where you've got your job up >>>and >>> >> >> running. >>> >> >> > > From there, you could create your own Maven project, and >>>migrate >>> >> >>your >>> >> >> > code >>> >> >> > > over accordingly. >>> >> >> > > >>> >> >> > > Cheers, >>> >> >> > > Chris >>> >> >> > > >>> >> >> > > On 8/24/14 1:42 AM, "Shekar Tippur" <[email protected]> >>>wrote: >>> >> >> > > >>> >> >> > > >Chris, >>> >> >> > > > >>> >> >> > > >I have gone thro the documentation and decided that the >>>option >>> >> >>that is >>> >> >> > > >most >>> >> >> > > >suitable for me is stream-table. >>> >> >> > > > >>> >> >> > > >I see the following things: >>> >> >> > > > >>> >> >> > > >1. Point samza to a table (database) >>> >> >> > > >2. Point Samza to a stream - Alert stream from different >>> sources >>> >> >> > > >3. Join key like a hostname >>> >> >> > > > >>> >> >> > > >I have Hello Samza working. To extend that to do what my >>>needs >>> >> >>are, I >>> >> >> am >>> >> >> > > >not sure where to start (Needs more code change OR >>> configuration >>> >> >> changes >>> >> >> > > >OR >>> >> >> > > >both)? >>> >> >> > > > >>> >> >> > > >I have gone thro >>> >> >> > > > >>> >> >> > >>> >> >> >>> >> >> >>> >> >>> >> >>> >>>http://samza.incubator.apache.org/learn/documentation/latest/api/overvie >>>w >>> >> >> > > . >>> >> >> > > >html >>> >> >> > > > >>> >> >> > > >Is my thought process on the right track? Can you please >>>point >>> >>me >>> >> >>to >>> >> >> the >>> >> >> > > >right direction? >>> >> >> > > > >>> >> >> > > >- Shekar >>> >> >> > > > >>> >> >> > > > >>> >> >> > > > >>> >> >> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur >>> >><[email protected]> >>> >> >> > wrote: >>> >> >> > > > >>> >> >> > > >> Chris, >>> >> >> > > >> >>> >> >> > > >> This is perfectly good answer. I will start poking more >>>into >>> >> >>option >>> >> >> > #4. >>> >> >> > > >> >>> >> >> > > >> - Shekar >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini < >>> >> >> > > >> [email protected]> wrote: >>> >> >> > > >> >>> >> >> > > >>> Hey Shekar, >>> >> >> > > >>> >>> >> >> > > >>> Your two options are really (3) or (4), then. You can >>>either >>> >>run >>> >> >> some >>> >> >> > > >>> external DB that holds the data set, and you can query it >>> >>from a >>> >> >> > > >>> StreamTask, or you can use Samza's state store feature to >>> >>push >>> >> >>data >>> >> >> > > >>>into a >>> >> >> > > >>> stream that you can then store in a partitioned key-value >>> >>store >>> >> >> along >>> >> >> > > >>>with >>> >> >> > > >>> your StreamTasks. There is some documentation here about >>>the >>> >> >>state >>> >> >> > > >>>store >>> >> >> > > >>> approach: >>> >> >> > > >>> >>> >> >> > > >>> >>> >> >> > > >>> >>> >> >> > > >>> >>> >> >> > > >>> >> >> >>> >> >>> >>>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st >>> >> >> > > >>>ate >>> >> >> > > >>> -management.html >>> >> >> > > >>> >>> >> >> > > >>>< >>> >> >> > > >>> >> >> >>> >>>>>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/ >>>>>s >>> >> >> > > >>>tate-management.html> >>> >> >> > > >>> >>> >> >> > > >>> >>> >> >> > > >>> (4) is going to require more up front effort from you, >>>since >>> >> >>you'll >>> >> >> > > >>>have >>> >> >> > > >>> to understand how Kafka's partitioning model works, and >>> setup >>> >> >>some >>> >> >> > > >>> pipeline to push the updates for your state. In the long >>> >>run, I >>> >> >> > believe >>> >> >> > > >>> it's the better approach, though. Local lookups on a >>> >>key-value >>> >> >> store >>> >> >> > > >>> should be faster than doing remote RPC calls to a DB for >>> >>every >>> >> >> > message. >>> >> >> > > >>> >>> >> >> > > >>> I'm sorry I can't give you a more definitive answer. It's >>> >>really >>> >> >> > about >>> >> >> > > >>> trade-offs. >>> >> >> > > >>> >>> >> >> > > >>> Cheers, >>> >> >> > > >>> Chris >>> >> >> > > >>> >>> >> >> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> >>> >>wrote: >>> >> >> > > >>> >>> >> >> > > >>> >Chris, >>> >> >> > > >>> > >>> >> >> > > >>> >A big thanks for a swift response. The data set is huge >>>and >>> >>the >>> >> >> > > >>>frequency >>> >> >> > > >>> >is in burst. >>> >> >> > > >>> >What do you suggest? >>> >> >> > > >>> > >>> >> >> > > >>> >- Shekar >>> >> >> > > >>> > >>> >> >> > > >>> > >>> >> >> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini < >>> >> >> > > >>> >[email protected]> wrote: >>> >> >> > > >>> > >>> >> >> > > >>> >> Hey Shekar, >>> >> >> > > >>> >> >>> >> >> > > >>> >> This is feasible, and you are on the right thought >>> >>process. >>> >> >> > > >>> >> >>> >> >> > > >>> >> For the sake of discussion, I'm going to pretend that >>>you >>> >> >>have a >>> >> >> > > >>>Kafka >>> >> >> > > >>> >> topic called "PageViewEvent", which has just the IP >>> >>address >>> >> >>that >>> >> >> > was >>> >> >> > > >>> >>used >>> >> >> > > >>> >> to view a page. These messages will be logged every >>>time >>> a >>> >> >>page >>> >> >> > view >>> >> >> > > >>> >> happens. I'm also going to pretend that you have some >>> >>state >>> >> >> called >>> >> >> > > >>> >>"IPGeo" >>> >> >> > > >>> >> (e.g. The maxmind data set). In this example, we'll >>>want >>> >>to >>> >> >>join >>> >> >> > the >>> >> >> > > >>> >> long/lat geo information from IPGeo to the >>>PageViewEvent, >>> >>and >>> >> >> send >>> >> >> > > >>>it >>> >> >> > > >>> >>to a >>> >> >> > > >>> >> new topic: PageViewEventsWithGeo. >>> >> >> > > >>> >> >>> >> >> > > >>> >> You have several options on how to implement this >>> example. >>> >> >> > > >>> >> >>> >> >> > > >>> >> 1. If your joining data set (IPGeo) is relatively >>>small >>> >>and >>> >> >> > changes >>> >> >> > > >>> >> infrequently, you can just pack it up in your jar or >>>.tgz >>> >> >>file, >>> >> >> > and >>> >> >> > > >>> open >>> >> >> > > >>> >> it open in every StreamTask. >>> >> >> > > >>> >> 2. If your data set is small, but changes somewhat >>> >> >>frequently, >>> >> >> you >>> >> >> > > >>>can >>> >> >> > > >>> >> throw the data set on some HTTP/HDFS/S3 server >>>somewhere, >>> >>and >>> >> >> have >>> >> >> > > >>>your >>> >> >> > > >>> >> StreamTask refresh it periodically by re-downloading >>>it. >>> >> >> > > >>> >> 3. You can do remote RPC calls for the IPGeo data on >>> every >>> >> >>page >>> >> >> > view >>> >> >> > > >>> >>event >>> >> >> > > >>> >> by query some remote service or DB (e.g. Cassandra). >>> >> >> > > >>> >> 4. You can use Samza's state feature to set your IPGeo >>> >>data >>> >> >>as a >>> >> >> > > >>>series >>> >> >> > > >>> >>of >>> >> >> > > >>> >> messages to a log-compacted Kafka topic >>> >> >> > > >>> >> ( >>> >> >> https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction >>> >> >> > ), >>> >> >> > > >>> and >>> >> >> > > >>> >> configure your Samza job to read this topic as a >>> bootstrap >>> >> >> stream >>> >> >> > > >>> >> ( >>> >> >> > > >>> >> >>> >> >> > > >>> >> >>> >> >> > > >>> >>> >> >> > > >>> >>> >> >> > > >>> >> >> >>> >> >>> >>>http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st >>> >> >> > > >>>r >>> >> >> > > >>> >>e >>> >> >> > > >>> >> ams.html). >>> >> >> > > >>> >> >>> >> >> > > >>> >> For (4), you'd have to partition the IPGeo state topic >>> >> >>according >>> >> >> > to >>> >> >> > > >>>the >>> >> >> > > >>> >> same key as PageViewEvent. If PageViewEvent were >>> >>partitioned >>> >> >>by, >>> >> >> > > >>>say, >>> >> >> > > >>> >> member ID, but you want your IPGeo state topic to be >>> >> >>partitioned >>> >> >> > by >>> >> >> > > >>>IP >>> >> >> > > >>> >> address, then you'd have to have an upstream job that >>> >> >> > re-partitioned >>> >> >> > > >>> >> PageViewEvent into some new topic by IP address. This >>>new >>> >> >>topic >>> >> >> > will >>> >> >> > > >>> >>have >>> >> >> > > >>> >> to have the same number of partitions as the IPGeo >>>state >>> >> >>topic >>> >> >> (if >>> >> >> > > >>> IPGeo >>> >> >> > > >>> >> has 8 partitions, then the new >>>PageViewEventRepartitioned >>> >> >>topic >>> >> >> > > >>>needs 8 >>> >> >> > > >>> >>as >>> >> >> > > >>> >> well). This will cause your PageViewEventRepartitioned >>> >>topic >>> >> >>and >>> >> >> > > >>>your >>> >> >> > > >>> >> IPGeo state topic to be aligned such that the >>>StreamTask >>> >>that >>> >> >> gets >>> >> >> > > >>>page >>> >> >> > > >>> >> views for IP address X will also have the IPGeo >>> >>information >>> >> >>for >>> >> >> IP >>> >> >> > > >>> >>address >>> >> >> > > >>> >> X. >>> >> >> > > >>> >> >>> >> >> > > >>> >> Which strategy you pick is really up to you. :) (4) is >>> the >>> >> >>most >>> >> >> > > >>> >> complicated, but also the most flexible, and most >>> >> >>operationally >>> >> >> > > >>>sound. >>> >> >> > > >>> >>(1) >>> >> >> > > >>> >> is the easiest if it fits your needs. >>> >> >> > > >>> >> >>> >> >> > > >>> >> Cheers, >>> >> >> > > >>> >> Chris >>> >> >> > > >>> >> >>> >> >> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur" >>><[email protected]> >>> >> >>wrote: >>> >> >> > > >>> >> >>> >> >> > > >>> >> >Hello, >>> >> >> > > >>> >> > >>> >> >> > > >>> >> >I am new to Samza. I have just installed Hello Samza >>>and >>> >> >>got it >>> >> >> > > >>> >>working. >>> >> >> > > >>> >> > >>> >> >> > > >>> >> >Here is the use case for which I am trying to use >>>Samza: >>> >> >> > > >>> >> > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> >1. Cache the contextual information which contains >>>more >>> >> >> > information >>> >> >> > > >>> >>about >>> >> >> > > >>> >> >the hostname or IP address using Samza/Yarn/Kafka >>> >> >> > > >>> >> >2. Collect alert and metric events which contain >>>either >>> >> >> hostname >>> >> >> > > >>>or IP >>> >> >> > > >>> >> >address >>> >> >> > > >>> >> >3. Append contextual information to the alert and >>>metric >>> >>and >>> >> >> > > >>>insert to >>> >> >> > > >>> >>a >>> >> >> > > >>> >> >Kafka queue from which other subscribers read off of. >>> >> >> > > >>> >> > >>> >> >> > > >>> >> >Can you please shed some light on >>> >> >> > > >>> >> > >>> >> >> > > >>> >> >1. Is this feasible? >>> >> >> > > >>> >> >2. Am I on the right thought process >>> >> >> > > >>> >> >3. How do I start >>> >> >> > > >>> >> > >>> >> >> > > >>> >> >I now have 1 & 2 of them working disparately. I need >>>to >>> >> >> integrate >>> >> >> > > >>> them. >>> >> >> > > >>> >> > >>> >> >> > > >>> >> >Appreciate any input. >>> >> >> > > >>> >> > >>> >> >> > > >>> >> >- Shekar >>> >> >> > > >>> >> >>> >> >> > > >>> >> >>> >> >> > > >>> >>> >> >> > > >>> >>> >> >> > > >> >>> >> >> > > >>> >> >> > > >>> >> >> > >>> >> >> >>> >> >>> >> >>> >>> >>
