Running on AWS/EMR/Yarn - where is the WebUI?

2016-08-15 Thread Jon Yeargers
Working with a 3 node cluster. Started via YARN.

If I go to port 8080 I see the Tomcat start screen. 8088 has the Yarn
screen.

Didn't see anything obvious to start the UI in the bin folder.


Performance issues - is my topology not setup properly?

2016-08-15 Thread Jon Yeargers
Flink 1.1.1 is running on AWS / EMR. 3 boxes - total 24 cores and 90Gb of
RAM.

Job is submitted via yarn.

Topology:

read csv files from SQS -> parse files by line  and create object for each
line -> pass through 'KeySelector' to pair entries (by hash) over 60 second
window -> write original and matched sets to BigQuery.

Each file contains ~ 15K lines and there are ~10 files / second.

My topology can't keep up with this stream. What am I doing wrong?

Articles like this (
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/)
speak of > 1 million events / sec / core. Im not clear what constitutes an
'event' but given the number of cores Im throwing at this problem I would
expect higher throughput.

I run the job as :

HADOOP_CONF_DIR=/etc/hadoop/conf bin/flink run -m yarn-cluster -yn 3 -ys 8
-yst -ytm 4096 ../flink_all.jar


Output of KeyBy->TimeWindow->Reduce?

2016-08-07 Thread Jon Yeargers
Take the above topology and send the resultant DataStream to .print().

Assume the Reduce function changes a character to it's uppercase equivalent.

Assume this whole stream makes it within the span of a single Window.

Send it a stream like a,b,c,d,e,f,g,h,c,e,g.


What will be output?

a) a,b,C,d,E,f,G,h
b) a,b,c,d,e,f,g,h
c) a,b,c,c,d,e,e,f,g,g,h
d) something else


What is output from DataSet.print()?

2016-08-02 Thread Jon Yeargers
Topology snip:

datastream = 
some_stream.keyBy(keySelector).timeWindow(Time.seconds(60)).reduce(new
some_KeyReduce());


If I have a KeySelector that's pretty 'loose' (IE lots of matches) the
'some_KeyReduce' function gets hit frequently and some set of values is
printed out via 'datastream.print()'.

If I have a more stringent KeySelector the 'keyReduce' function never gets
called but the 'datastream.print()' function still outputs numerous values.

So how are the KeySelector and the output of the datastream.print()
related? Or are they?