Re: S3 recovery and checkpoint directories exhibit explosive growth

2017-07-23 Thread prashantnayak
Hi Xiaogang and Stephan

We're continuing to test and have now set up the cluster to disable
incremental RocksDB checkpointing as well as increasing the checkpoint
interval from 30s to 120s  (not ideal really :-( )

We'll run it with a large number of jobs and report back if this setup shows
improvement.

Appreciate any another insights you might have around this problem.

Thanks
Prashant



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and-checkpoint-directories-exhibit-explosive-growth-tp14270p14392.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Count Different Codes in a Window

2017-07-23 Thread Raj Kumar
Hi,

we have a requirement where we need to aggregate the data every 10mins and
write ONCE the aggregated results to the elastic search. Right now, we are
iterating over the iterable to make a count of different status codes to do
this. Is there a better way to count different status codes.

public void apply(TimeWindow timeWindow, Iterable> iterable, Collector> collector) throws Exception {

long[] counts=new long[10];
Arrays.fill(counts,0l);

//count different type of records in a window
for (Tuple4 in : iterable) {
counts[0]++;
if (in.f2!=null && in.f2.startsWith("5"))
counts[1]++;
else if (in.f2!=null && in.f2.startsWith("4"))
counts[2]++;
else if (in.f2!=null && in.f2.startsWith("2"))
counts[3]++;

if(in.f3!=null && in.f3.equalsIgnoreCase("GET"))
counts[4]++;
else if(in.f3!=null && in.f3.equalsIgnoreCase("POST"))
counts[5]++;
else if(in.f3!=null && in.f3.equalsIgnoreCase("PUT"))
counts[6]++;
else if(in.f3!=null && in.f3.equalsIgnoreCase("HEAD"))
counts[7]++;

}
...
}





--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Count-Different-Codes-in-a-Window-tp14391.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: problems starting the training exercise TaxiRideCleansing on local cluster

2017-07-23 Thread Günter Hipler

Hi Nico,

thanks for looking into it. The reason for the behavior on my system: I 
had two different jdk versions installed (openjdk and oracle jdk) - I 
wasn't aware of because I prefer to use generally the oracle jdk. 
Somehow, I didn't analyze at greater depth, both versions were used in 
different scenarios which seemed to cause the error. After removing 
openjdk completely from my system I could use the local flink cluster 
with my application jar.


Sorry for the inconvenience!

Günter


On 10.07.2017 15:43, Nico Kruber wrote:

Hi Günter,
unfortunately, I cannot reproduce your error. This is what I did (following
http://training.data-artisans.com/devEnvSetup.html):

* clone and build the flink-training-exercises project:
git clone https://github.com/dataArtisans/flink-training-exercises.git
cd flink-training-exercises
mvn clean install

* downloaded & extracted flink 1.3.1 (hadoop 2.7, Scala 2.10 - but that should
not matter)
* /bin/start-local.sh

* create a development project:
mvn archetype:generate \
 -DarchetypeGroupId=org.apache.flink\
 -DarchetypeArtifactId=flink-quickstart-java\
 -DarchetypeVersion=1.3.1   \
 -DgroupId=org.apache.flink.quickstart  \
 -DartifactId=flink-java-project\
 -Dversion=0.1  \
 -Dpackage=org.apache.flink.quickstart  \
 -DinteractiveMode=false
* add flink-training-exercises 0.10.0 dependency

   com.data-artisans
   flink-training-exercises
   0.10.0

* implement the task (http://training.data-artisans.com/exercises/
rideCleansing.html)
* /flink run -c
org.apache.flink.quickstart.TaxiStreamCleansingJob ./flink-java-project/target/
flink-java-project-0.1.jar

What I noticed though is that my dependency tree only contains joda-
time-2.7.jar not 2.9.9 as in your case - did you change the dependencies
somehow?
mvn clean package
...
INFO] Including joda-time:joda-time:jar:2.7 in the shaded jar.
...

Could you try with a new development project set up the way above and copy
your code into this?

If that doesn't help, try with a freshly extracted unaltered flink archive.


Nico


On Sunday, 9 July 2017 17:32:25 CEST Günter Hipler wrote:

Thanks for response.

My classpath contains a version

   mvn dependency:build-classpath
[INFO] Scanning for projects...
[INFO]
[INFO]

[INFO] Building Flink Quickstart Job 0.1
[INFO]

[INFO]
[INFO] --- maven-dependency-plugin:2.8:build-classpath (default-cli) @
flink-java-project ---
[INFO] Dependencies classpath:


togram/2.1.6/HdrHistogram-2.1.6.jar:/home/swissbib/.m2/repository/com/twitte
r/jsr166e/1.1.0/jsr166e-1.1.0.jar:/home/swissbib/.m2/repository/joda-time/jo
da-time/2.9.9/joda-time-2.9.9.jar:



which contains definitely the required method.
(http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormatte
r.html#withZoneUTC--)

Something else is going wrong. I guess the way how I started (or
configured) the local cluster (but it's done as described in the
training setup (http://training.data-artisans.com/devEnvSetup.html) -
which is very straightforward.

Günter

On 09.07.2017 16:17, Ted Yu wrote:

Since the exception was about a missing method (withZoneUTC) instead
of class not found, it was likely due to a conflicting joda time jar
being on the classpath.

Cheers

On Sun, Jul 9, 2017 at 1:22 AM, Günter Hipler

> wrote:
 Hi,
 
 sorry for this newbie question...
 
 I'm following the data artisans exercises and wanted to run the

 TaxiRide Cleansing job on my local cluster (version 1.3.1)
 (http://training.data-artisans.com/exercises/rideCleansing.html
 )
 
 While this is possible within my IDE the cluster throws an

 exception because of a missing type although the missed type is
 part of the application jar the cluster is provided with.
 
 swissbib@ub-sbhp02:~/environment/code/flink_einarbeitung/training/flin

 k-java-project/target$ jar tf flink-java-project-0.1.jar | grep
 DateTimeFormatter
 org/elasticsearch/common/joda/FormatDateTimeFormatter.class
 org/joda/time/format/DateTimeFormatter.class
 org/joda/time/format/DateTimeFormatterBuilder$CharacterLiteral.class
 org/joda/time/format/DateTimeFormatterBuilder$Composite.class
 org/joda/time/format/DateTimeFormatterBuilder$FixedNumber.class
 org/joda/time/format/DateTimeFormatterBuilder$Fraction.class
 org/joda/time/format/DateTimeFormatterBuilder$MatchingParser.class
 org/joda/time/format/DateTimeFormatterBuilder$NumberFormatter.class
 org/joda/time/format/DateTimeFormatterBuilder$PaddedNumber.class
 

Find the running median from a data stream

2017-07-23 Thread Gabriele Di Bernardo
Hi guys,

I want to keep track of the running median of a keyed data stream. I was 
considering to apply a RichMapFunction to the stream and store in a ValueState 
object two heaps (PriorityQueue) in order to find the running median. However, 
I am not really sure if this is the best approach performance-wise. Do you have 
some suggestions for me or have you ever done something similar?

Thank you so much!

Best,


Gabriele

Re: notNext() and next(negation) not yielding same output in Flink CEP

2017-07-23 Thread Dawid Wysakowicz
Hi Yassine,

First of all notNext(A) is not equal to next(not A). notNext should be 
considered as a “stopCondition” which tells if an event matching the A 
condition occurs the current partial match is discarded. The next(not A) on the 
other hand accepts every event that do not match the A condition.

So let’s analyze a sequence of events like “b c a1 a2 a3 d”. For the first 
version with next(not A) the output will be “c a1 a2 a3 d” which is what you 
expect, I think. In the other version with notNext(A) a partial match “c a1” 
will be discarded after “a2” as the notNext says that after the A’s there 
should be no A.

I hope this helps understanding how notNext works.

Regards,
Dawid

> On 22 Jul 2017, at 20:32, Yassine MARZOUGUI  wrote:
> 
> Hi all,
> 
> I would like to match the maximal consecutive sequences of events of type A 
> in a stream.
> I'm using the following :
> Pattern.begin("start").where(event is not A)
> .next("middle").where(event is A).oneOrMore().consecutive()
> .next("not").where(event is not A)
> I This give the output I want. However if I use notNext("not").where(event is 
> A) instead of next("not").where(event is not A), the middle patterns contain 
> only sequences of single elements of type A.
> My understaning is that notNext() in this case is equivalent to 
> next(negation), so why is the output different?
> 
> Thank you in advance.
> 
> Best,
> Yassine



signature.asc
Description: Message signed with OpenPGP


Re: Gelly PageRank implementations in 1.2 to 1.3

2017-07-23 Thread Kaepke, Marc
Hi Greg,

I do an evaluation between Gelly and GraphX (Spark). Both frameworks implement 
PageRank and Gelly provides a lot of variants (*thumbs up*).
During a really small initial test I get for the vertex-centric, scatter-gather 
and gsa version the same ranking result. Just the implementation in 1.3.X 
(without any graph model) computed a different result (ranking).


/* vertex centric */
DataSet> pagerankVC = small.run(new PageRank<>(0.5, 10));
System.err.println("VC");
pagerankVC.printToErr();

/* scatter gather */
DataSet> pageRankSG = small
.run(new org.apache.flink.graph.library.PageRank<>(0.5, 10));
System.err.println("SG");
pageRankSG.printToErr();

/* gsa */
DataSet> pageRankGSA = small.run(new GSAPageRank<>(0.5, 
10));
System.err.println("GSA");
pageRankGSA.printToErr();

/* without graph model */
DataSet pageRankDI = small
.run(new PageRank<>(0.5, 10));
System.err.println("delta iteration");
pageRankDI.printToErr();

My input graph is:
vertices

  *   id 1, val 0
  *   id 2, val 0
  *   id 3, val 0
  *   id 4, val 0

edges

  *   src 1, trg 2, val 3
  *   src 1, trg 1, val 2
  *   src 2, trg 1, val 3
  *   src 2, trg 4, val 6

Ranking output

  *   vertex-centric
 *   id 4 with 1.16
 *   id 1 with 1.103
 *   id 2 with 0.815
 *   id 3 with 0
  *   sg and gsa
 *   id 4 with 2.208
 *   id 1 with 2.114
 *   id 2 with 1.546
 *   id 3 with 0
  *   new PageRank in Gelly 1.3.X
 *   id 1 with 0.392
 *   id 2 with 0.313
 *   id 4 with 0.294

Do you know why?


Best
Marc


Am 23.07.2017 um 02:22 schrieb Greg Hogan 
>:

Hi Marc,

PageRank and GSAPageRank were moved to the flink-gelly-examples jar in the 
org.apache.flink.graph.examples package. A library algorithm was added that 
supports both source and sink vertices. This limitation of the old algorithms 
was noted in the class documentation and I understand to be an effect of delta 
iterations. The new implementation is also significantly faster 
(https://github.com/apache/flink/pull/2733#issuecomment-278789830).

PageRank can be run using the examples jar from the command line, for example 
(don’t wildcard the jar file as in the documentation until we get the javadoc 
jar removed from the next release).

$ mv opt/flink-gelly* lib/
$ ./bin/flink run examples/gelly/flink-gelly-examples_2.11-1.3.1.jar \
--algorithm PageRank \
--input CSV --type integer --simplify directed --input_filename  
--input_field_delimiter $'\t' \
--output print

The output can also be written to CSV in similar fashion to the input.

The code to call the library PageRank from the examples driver is as with any 
GraphAlgorithm 
(https://github.com/apache/flink/blob/release-1.3/flink-libraries/flink-gelly-examples/src/main/java/org/apache/flink/graph/drivers/PageRank.java):

graph.run(new PageRank(dampingFactor, iterations,  
convergenceThreshold));

Please let us know of any issues or additional questions!

Greg


On Jul 22, 2017, at 4:33 PM, Kaepke, Marc 
> wrote:

Hi there,

why was the PageRank version (which implements the GraphAlgorithm interface) 
removed in 1.3?

How can I use the new PageRank implementation in 1.3.x?

Why PageRank doesn’t use the graph processing models (vertex-centric, sg or 
gsa) anymore?

Thanks!

Bests,
marc