Chris .. I am using a rhel server to host all the components (Yarn, kafka, samza). I dont have ACLs open to wikipedia. I am following .. http://samza.incubator.apache.org/learn/tutorials/latest/run-hello-samza-without-internet.html
- Shekar On Tue, Sep 2, 2014 at 2:13 PM, Chris Riccomini < [email protected]> wrote: > Hey Shekar, > > Can you try changing that to: > > http://127.0.0.1:8088/cluster > > > And see if you can connect? > > Cheers, > Chris > > On 9/2/14 1:21 PM, "Shekar Tippur" <[email protected]> wrote: > > >Other observation is .. > > > >http://10.132.62.185:8088/cluster shows that no applications are running. > > > >- Shekar > > > > > > > > > >On Tue, Sep 2, 2014 at 1:15 PM, Shekar Tippur <[email protected]> wrote: > > > >> Yarn seem to be running .. > >> > >> yarn 5462 0.0 2.0 1641296 161404 ? Sl Jun20 95:26 > >> /usr/java/jdk1.6.0_31/bin/java -Dproc_resourcemanager -Xmx1000m > >> -Dhadoop.log.dir=/var/log/hadoop-yarn > >>-Dyarn.log.dir=/var/log/hadoop-yarn > >> -Dhadoop.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log > >> -Dyarn.log.file=yarn-yarn-resourcemanager-pppdc9prd2y2.log > >> -Dyarn.home.dir=/usr/lib/hadoop-yarn > >>-Dhadoop.home.dir=/usr/lib/hadoop-yarn > >> -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA > >> -Djava.library.path=/usr/lib/hadoop/lib/native -classpath > >> > >>/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*: > >>/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/ > >>usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/ > >>.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/ > >>lib/hadoop-yarn/.//*:/usr/lib/hadoop-yarn/lib/*:/etc/hadoop/conf/rm-confi > >>g/log4j.properties > >> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager > >> > >> I can tail kafka topic as well .. > >> > >> deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 > >>--topic wikipedia-raw > >> > >> > >> > >> > >> On Tue, Sep 2, 2014 at 1:04 PM, Chris Riccomini < > >> [email protected]> wrote: > >> > >>> Hey Shekar, > >>> > >>> It looks like your job is hanging trying to connect to the RM on your > >>> localhost. I thought that you said that your job was running in local > >>> mode. If so, it should be using the LocalJobFactory. If not, and you > >>> intend to run on YARN, is your YARN RM up and running on localhost? > >>> > >>> Cheers, > >>> Chris > >>> > >>> On 9/2/14 12:22 PM, "Shekar Tippur" <[email protected]> wrote: > >>> > >>> >Chris .. > >>> > > >>> >$ cat ./deploy/samza/undefined-samza-container-name.log > >>> > > >>> >2014-09-02 11:17:58 JobRunner [INFO] job factory: > >>> >org.apache.samza.job.yarn.YarnJobFactory > >>> > > >>> >2014-09-02 11:17:59 ClientHelper [INFO] trying to connect to RM > >>> >127.0.0.1:8032 > >>> > > >>> >2014-09-02 11:17:59 NativeCodeLoader [WARN] Unable to load > >>>native-hadoop > >>> >library for your platform... using builtin-java classes where > >>>applicable > >>> > > >>> >2014-09-02 11:17:59 RMProxy [INFO] Connecting to ResourceManager at / > >>> >127.0.0.1:8032 > >>> > > >>> > > >>> >and Log4j .. > >>> > > >>> ><!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> > >>> > > >>> ><log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/"> > >>> > > >>> > <appender name="RollingAppender" > >>> >class="org.apache.log4j.DailyRollingFileAppender"> > >>> > > >>> > <param name="File" > >>> >value="${samza.log.dir}/${samza.container.name}.log" > >>> >/> > >>> > > >>> > <param name="DatePattern" value="'.'yyyy-MM-dd" /> > >>> > > >>> > <layout class="org.apache.log4j.PatternLayout"> > >>> > > >>> > <param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss} > >>> %c{1} > >>> >[%p] %m%n" /> > >>> > > >>> > </layout> > >>> > > >>> > </appender> > >>> > > >>> > <root> > >>> > > >>> > <priority value="info" /> > >>> > > >>> > <appender-ref ref="RollingAppender"/> > >>> > > >>> > </root> > >>> > > >>> ></log4j:configuration> > >>> > > >>> > > >>> >On Tue, Sep 2, 2014 at 12:02 PM, Chris Riccomini < > >>> >[email protected]> wrote: > >>> > > >>> >> Hey Shekar, > >>> >> > >>> >> Can you attach your log files? I'm wondering if it's a > >>>mis-configured > >>> >> log4j.xml (or missing slf4j-log4j jar), which is leading to nearly > >>> empty > >>> >> log files. Also, I'm wondering if the job starts fully. Anything you > >>> can > >>> >> attach would be helpful. > >>> >> > >>> >> Cheers, > >>> >> Chris > >>> >> > >>> >> On 9/2/14 11:43 AM, "Shekar Tippur" <[email protected]> wrote: > >>> >> > >>> >> >I am running in local mode. > >>> >> > > >>> >> >S > >>> >> >On Sep 2, 2014 11:42 AM, "Yan Fang" <[email protected]> wrote: > >>> >> > > >>> >> >> Hi Shekar. > >>> >> >> > >>> >> >> Are you running job in local mode or yarn mode? If yarn mode, the > >>> log > >>> >> >>is in > >>> >> >> the yarn's container log. > >>> >> >> > >>> >> >> Thanks, > >>> >> >> > >>> >> >> Fang, Yan > >>> >> >> [email protected] > >>> >> >> +1 (206) 849-4108 > >>> >> >> > >>> >> >> > >>> >> >> On Tue, Sep 2, 2014 at 11:36 AM, Shekar Tippur > >>><[email protected]> > >>> >> >>wrote: > >>> >> >> > >>> >> >> > Chris, > >>> >> >> > > >>> >> >> > Got some time to play around a bit more. > >>> >> >> > I tried to edit > >>> >> >> > > >>> >> >> > > >>> >> >> > >>> >> > >>> > >>> > >>>>>>>samza-wikipedia/src/main/java/samza/examples/wikipedia/task/Wikipedi > >>>>>>>aFe > >>> >>>>ed > >>> >> >>StreamTask.java > >>> >> >> > to add a logger info statement to tap the incoming message. I > >>>dont > >>> >>see > >>> >> >> the > >>> >> >> > messages being printed to the log file. > >>> >> >> > > >>> >> >> > Is this the right place to start? > >>> >> >> > > >>> >> >> > public class WikipediaFeedStreamTask implements StreamTask { > >>> >> >> > > >>> >> >> > private static final SystemStream OUTPUT_STREAM = new > >>> >> >> > SystemStream("kafka", > >>> >> >> > "wikipedia-raw"); > >>> >> >> > > >>> >> >> > private static final Logger log = LoggerFactory.getLogger > >>> >> >> > (WikipediaFeedStreamTask.class); > >>> >> >> > > >>> >> >> > @Override > >>> >> >> > > >>> >> >> > public void process(IncomingMessageEnvelope envelope, > >>> >> >>MessageCollector > >>> >> >> > collector, TaskCoordinator coordinator) { > >>> >> >> > > >>> >> >> > Map<String, Object> outgoingMap = > >>> >> >> > WikipediaFeedEvent.toMap((WikipediaFeedEvent) > >>> >>envelope.getMessage()); > >>> >> >> > > >>> >> >> > log.info(envelope.getMessage().toString()); > >>> >> >> > > >>> >> >> > collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, > >>> >> >> > outgoingMap)); > >>> >> >> > > >>> >> >> > } > >>> >> >> > > >>> >> >> > } > >>> >> >> > > >>> >> >> > > >>> >> >> > On Mon, Aug 25, 2014 at 9:01 AM, Chris Riccomini < > >>> >> >> > [email protected]> wrote: > >>> >> >> > > >>> >> >> > > Hey Shekar, > >>> >> >> > > > >>> >> >> > > Your thought process is on the right track. It's probably > >>>best > >>> to > >>> >> >>start > >>> >> >> > > with hello-samza, and modify it to get what you want. To > >>>start > >>> >>with, > >>> >> >> > > you'll want to: > >>> >> >> > > > >>> >> >> > > 1. Write a simple StreamTask that just does something silly > >>>like > >>> >> >>just > >>> >> >> > > print messages that it receives. > >>> >> >> > > 2. Write a configuration for the job that consumes from just > >>>the > >>> >> >>stream > >>> >> >> > > (alerts from different sources). > >>> >> >> > > 3. Run this to make sure you've got it working. > >>> >> >> > > 4. Now add your table join. This can be either a change-data > >>> >>capture > >>> >> >> > (CDC) > >>> >> >> > > stream, or via a remote DB call. > >>> >> >> > > > >>> >> >> > > That should get you to a point where you've got your job up > >>>and > >>> >> >> running. > >>> >> >> > > From there, you could create your own Maven project, and > >>>migrate > >>> >> >>your > >>> >> >> > code > >>> >> >> > > over accordingly. > >>> >> >> > > > >>> >> >> > > Cheers, > >>> >> >> > > Chris > >>> >> >> > > > >>> >> >> > > On 8/24/14 1:42 AM, "Shekar Tippur" <[email protected]> > >>>wrote: > >>> >> >> > > > >>> >> >> > > >Chris, > >>> >> >> > > > > >>> >> >> > > >I have gone thro the documentation and decided that the > >>>option > >>> >> >>that is > >>> >> >> > > >most > >>> >> >> > > >suitable for me is stream-table. > >>> >> >> > > > > >>> >> >> > > >I see the following things: > >>> >> >> > > > > >>> >> >> > > >1. Point samza to a table (database) > >>> >> >> > > >2. Point Samza to a stream - Alert stream from different > >>> sources > >>> >> >> > > >3. Join key like a hostname > >>> >> >> > > > > >>> >> >> > > >I have Hello Samza working. To extend that to do what my > >>>needs > >>> >> >>are, I > >>> >> >> am > >>> >> >> > > >not sure where to start (Needs more code change OR > >>> configuration > >>> >> >> changes > >>> >> >> > > >OR > >>> >> >> > > >both)? > >>> >> >> > > > > >>> >> >> > > >I have gone thro > >>> >> >> > > > > >>> >> >> > > >>> >> >> > >>> >> >> > >>> >> > >>> >> > >>> > >>> > http://samza.incubator.apache.org/learn/documentation/latest/api/overvie > >>>w > >>> >> >> > > . > >>> >> >> > > >html > >>> >> >> > > > > >>> >> >> > > >Is my thought process on the right track? Can you please > >>>point > >>> >>me > >>> >> >>to > >>> >> >> the > >>> >> >> > > >right direction? > >>> >> >> > > > > >>> >> >> > > >- Shekar > >>> >> >> > > > > >>> >> >> > > > > >>> >> >> > > > > >>> >> >> > > >On Thu, Aug 21, 2014 at 1:08 PM, Shekar Tippur > >>> >><[email protected]> > >>> >> >> > wrote: > >>> >> >> > > > > >>> >> >> > > >> Chris, > >>> >> >> > > >> > >>> >> >> > > >> This is perfectly good answer. I will start poking more > >>>into > >>> >> >>option > >>> >> >> > #4. > >>> >> >> > > >> > >>> >> >> > > >> - Shekar > >>> >> >> > > >> > >>> >> >> > > >> > >>> >> >> > > >> On Thu, Aug 21, 2014 at 1:05 PM, Chris Riccomini < > >>> >> >> > > >> [email protected]> wrote: > >>> >> >> > > >> > >>> >> >> > > >>> Hey Shekar, > >>> >> >> > > >>> > >>> >> >> > > >>> Your two options are really (3) or (4), then. You can > >>>either > >>> >>run > >>> >> >> some > >>> >> >> > > >>> external DB that holds the data set, and you can query it > >>> >>from a > >>> >> >> > > >>> StreamTask, or you can use Samza's state store feature to > >>> >>push > >>> >> >>data > >>> >> >> > > >>>into a > >>> >> >> > > >>> stream that you can then store in a partitioned key-value > >>> >>store > >>> >> >> along > >>> >> >> > > >>>with > >>> >> >> > > >>> your StreamTasks. There is some documentation here about > >>>the > >>> >> >>state > >>> >> >> > > >>>store > >>> >> >> > > >>> approach: > >>> >> >> > > >>> > >>> >> >> > > >>> > >>> >> >> > > >>> > >>> >> >> > > >>> > >>> >> >> > > > >>> >> >> > >>> >> > >>> > >>> > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st > >>> >> >> > > >>>ate > >>> >> >> > > >>> -management.html > >>> >> >> > > >>> > >>> >> >> > > >>>< > >>> >> >> > > > >>> >> >> > >>> > >>>>> > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/ > >>>>>s > >>> >> >> > > >>>tate-management.html> > >>> >> >> > > >>> > >>> >> >> > > >>> > >>> >> >> > > >>> (4) is going to require more up front effort from you, > >>>since > >>> >> >>you'll > >>> >> >> > > >>>have > >>> >> >> > > >>> to understand how Kafka's partitioning model works, and > >>> setup > >>> >> >>some > >>> >> >> > > >>> pipeline to push the updates for your state. In the long > >>> >>run, I > >>> >> >> > believe > >>> >> >> > > >>> it's the better approach, though. Local lookups on a > >>> >>key-value > >>> >> >> store > >>> >> >> > > >>> should be faster than doing remote RPC calls to a DB for > >>> >>every > >>> >> >> > message. > >>> >> >> > > >>> > >>> >> >> > > >>> I'm sorry I can't give you a more definitive answer. It's > >>> >>really > >>> >> >> > about > >>> >> >> > > >>> trade-offs. > >>> >> >> > > >>> > >>> >> >> > > >>> Cheers, > >>> >> >> > > >>> Chris > >>> >> >> > > >>> > >>> >> >> > > >>> On 8/21/14 12:22 PM, "Shekar Tippur" <[email protected]> > >>> >>wrote: > >>> >> >> > > >>> > >>> >> >> > > >>> >Chris, > >>> >> >> > > >>> > > >>> >> >> > > >>> >A big thanks for a swift response. The data set is huge > >>>and > >>> >>the > >>> >> >> > > >>>frequency > >>> >> >> > > >>> >is in burst. > >>> >> >> > > >>> >What do you suggest? > >>> >> >> > > >>> > > >>> >> >> > > >>> >- Shekar > >>> >> >> > > >>> > > >>> >> >> > > >>> > > >>> >> >> > > >>> >On Thu, Aug 21, 2014 at 11:05 AM, Chris Riccomini < > >>> >> >> > > >>> >[email protected]> wrote: > >>> >> >> > > >>> > > >>> >> >> > > >>> >> Hey Shekar, > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> This is feasible, and you are on the right thought > >>> >>process. > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> For the sake of discussion, I'm going to pretend that > >>>you > >>> >> >>have a > >>> >> >> > > >>>Kafka > >>> >> >> > > >>> >> topic called "PageViewEvent", which has just the IP > >>> >>address > >>> >> >>that > >>> >> >> > was > >>> >> >> > > >>> >>used > >>> >> >> > > >>> >> to view a page. These messages will be logged every > >>>time > >>> a > >>> >> >>page > >>> >> >> > view > >>> >> >> > > >>> >> happens. I'm also going to pretend that you have some > >>> >>state > >>> >> >> called > >>> >> >> > > >>> >>"IPGeo" > >>> >> >> > > >>> >> (e.g. The maxmind data set). In this example, we'll > >>>want > >>> >>to > >>> >> >>join > >>> >> >> > the > >>> >> >> > > >>> >> long/lat geo information from IPGeo to the > >>>PageViewEvent, > >>> >>and > >>> >> >> send > >>> >> >> > > >>>it > >>> >> >> > > >>> >>to a > >>> >> >> > > >>> >> new topic: PageViewEventsWithGeo. > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> You have several options on how to implement this > >>> example. > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> 1. If your joining data set (IPGeo) is relatively > >>>small > >>> >>and > >>> >> >> > changes > >>> >> >> > > >>> >> infrequently, you can just pack it up in your jar or > >>>.tgz > >>> >> >>file, > >>> >> >> > and > >>> >> >> > > >>> open > >>> >> >> > > >>> >> it open in every StreamTask. > >>> >> >> > > >>> >> 2. If your data set is small, but changes somewhat > >>> >> >>frequently, > >>> >> >> you > >>> >> >> > > >>>can > >>> >> >> > > >>> >> throw the data set on some HTTP/HDFS/S3 server > >>>somewhere, > >>> >>and > >>> >> >> have > >>> >> >> > > >>>your > >>> >> >> > > >>> >> StreamTask refresh it periodically by re-downloading > >>>it. > >>> >> >> > > >>> >> 3. You can do remote RPC calls for the IPGeo data on > >>> every > >>> >> >>page > >>> >> >> > view > >>> >> >> > > >>> >>event > >>> >> >> > > >>> >> by query some remote service or DB (e.g. Cassandra). > >>> >> >> > > >>> >> 4. You can use Samza's state feature to set your IPGeo > >>> >>data > >>> >> >>as a > >>> >> >> > > >>>series > >>> >> >> > > >>> >>of > >>> >> >> > > >>> >> messages to a log-compacted Kafka topic > >>> >> >> > > >>> >> ( > >>> >> >> https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction > >>> >> >> > ), > >>> >> >> > > >>> and > >>> >> >> > > >>> >> configure your Samza job to read this topic as a > >>> bootstrap > >>> >> >> stream > >>> >> >> > > >>> >> ( > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> > >>> >> >> > > >>> > >>> >> >> > > >>> > >>> >> >> > > > >>> >> >> > >>> >> > >>> > >>> > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/st > >>> >> >> > > >>>r > >>> >> >> > > >>> >>e > >>> >> >> > > >>> >> ams.html). > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> For (4), you'd have to partition the IPGeo state topic > >>> >> >>according > >>> >> >> > to > >>> >> >> > > >>>the > >>> >> >> > > >>> >> same key as PageViewEvent. If PageViewEvent were > >>> >>partitioned > >>> >> >>by, > >>> >> >> > > >>>say, > >>> >> >> > > >>> >> member ID, but you want your IPGeo state topic to be > >>> >> >>partitioned > >>> >> >> > by > >>> >> >> > > >>>IP > >>> >> >> > > >>> >> address, then you'd have to have an upstream job that > >>> >> >> > re-partitioned > >>> >> >> > > >>> >> PageViewEvent into some new topic by IP address. This > >>>new > >>> >> >>topic > >>> >> >> > will > >>> >> >> > > >>> >>have > >>> >> >> > > >>> >> to have the same number of partitions as the IPGeo > >>>state > >>> >> >>topic > >>> >> >> (if > >>> >> >> > > >>> IPGeo > >>> >> >> > > >>> >> has 8 partitions, then the new > >>>PageViewEventRepartitioned > >>> >> >>topic > >>> >> >> > > >>>needs 8 > >>> >> >> > > >>> >>as > >>> >> >> > > >>> >> well). This will cause your PageViewEventRepartitioned > >>> >>topic > >>> >> >>and > >>> >> >> > > >>>your > >>> >> >> > > >>> >> IPGeo state topic to be aligned such that the > >>>StreamTask > >>> >>that > >>> >> >> gets > >>> >> >> > > >>>page > >>> >> >> > > >>> >> views for IP address X will also have the IPGeo > >>> >>information > >>> >> >>for > >>> >> >> IP > >>> >> >> > > >>> >>address > >>> >> >> > > >>> >> X. > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> Which strategy you pick is really up to you. :) (4) is > >>> the > >>> >> >>most > >>> >> >> > > >>> >> complicated, but also the most flexible, and most > >>> >> >>operationally > >>> >> >> > > >>>sound. > >>> >> >> > > >>> >>(1) > >>> >> >> > > >>> >> is the easiest if it fits your needs. > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> Cheers, > >>> >> >> > > >>> >> Chris > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> On 8/21/14 10:15 AM, "Shekar Tippur" > >>><[email protected]> > >>> >> >>wrote: > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> >Hello, > >>> >> >> > > >>> >> > > >>> >> >> > > >>> >> >I am new to Samza. I have just installed Hello Samza > >>>and > >>> >> >>got it > >>> >> >> > > >>> >>working. > >>> >> >> > > >>> >> > > >>> >> >> > > >>> >> >Here is the use case for which I am trying to use > >>>Samza: > >>> >> >> > > >>> >> > > >>> >> >> > > >>> >> > > >>> >> >> > > >>> >> >1. Cache the contextual information which contains > >>>more > >>> >> >> > information > >>> >> >> > > >>> >>about > >>> >> >> > > >>> >> >the hostname or IP address using Samza/Yarn/Kafka > >>> >> >> > > >>> >> >2. Collect alert and metric events which contain > >>>either > >>> >> >> hostname > >>> >> >> > > >>>or IP > >>> >> >> > > >>> >> >address > >>> >> >> > > >>> >> >3. Append contextual information to the alert and > >>>metric > >>> >>and > >>> >> >> > > >>>insert to > >>> >> >> > > >>> >>a > >>> >> >> > > >>> >> >Kafka queue from which other subscribers read off of. > >>> >> >> > > >>> >> > > >>> >> >> > > >>> >> >Can you please shed some light on > >>> >> >> > > >>> >> > > >>> >> >> > > >>> >> >1. Is this feasible? > >>> >> >> > > >>> >> >2. Am I on the right thought process > >>> >> >> > > >>> >> >3. How do I start > >>> >> >> > > >>> >> > > >>> >> >> > > >>> >> >I now have 1 & 2 of them working disparately. I need > >>>to > >>> >> >> integrate > >>> >> >> > > >>> them. > >>> >> >> > > >>> >> > > >>> >> >> > > >>> >> >Appreciate any input. > >>> >> >> > > >>> >> > > >>> >> >> > > >>> >> >- Shekar > >>> >> >> > > >>> >> > >>> >> >> > > >>> >> > >>> >> >> > > >>> > >>> >> >> > > >>> > >>> >> >> > > >> > >>> >> >> > > > >>> >> >> > > > >>> >> >> > > >>> >> >> > >>> >> > >>> >> > >>> > >>> > >> > >
