Monitoring REST API

2016-12-21 Thread Lydia Ickler
Hi all,

I have a question regarding the Monitoring REST API;

I want to analyze the behavior of my program with regards to I/O MiB/s, Network 
MiB/s and CPU % as the authors of this paper did. 
(https://hal.inria.fr/hal-01347638v2/document 
)
From the JSON file at http:master:8081/jobs/jobid/ I get a summary including 
the information of read/write records and read/write bytes.
Unfortunately the entries of Network or CPU are either (unknown) or 0.0. I am 
running my program on a cluster with up to 32 nodes.

Where can I find the values for e.g. CPU or Network?

Thanks in advance!
Lydia



Monitoring Flink on Yarn

2016-12-19 Thread Lydia Ickler
Hi all, 

I am using Flink 1.1.3 on Yarn and I wanted to ask how I can save the 
monitoring logs, e.g. for I/O or network, to HDFS or local FS?
Since Yarn closes the Flink session after finishing the job I can't access the 
log via REST API.

I am looking forward to your answer!
Best regards,
Lydia 

multiple k-means in parallel

2016-11-27 Thread Lydia Ickler
Hi, 

I want to run k-means with different k in parallel. 
So each worker should calculate its own k-means. Is that possible? 

If I do a map on a list of integers to then apply k-means I get the following 
error:
Task not serializable

I am looking forward to your answers!
Lydia

Write matrix/vector

2016-05-29 Thread Lydia Ickler
Hi,

I would like to know how to write a Matrix or Vector (Dense/Sparse) to file?

Thanks in advance!
Best regards,
Lydia



sparse matrix

2016-05-29 Thread Lydia Ickler
Hi all,

I have two questions regarding sparse matrices:

1. I have a sparse Matrix: val sparseMatrix = SparseMatrix.fromCOO(row, col, 
csvInput.collect())
and now I would like to extract all values that are in a specific row X. How 
would I tackle that? flatMap() and filter() do not seem to be supported in that 
case.

2. I want to drop/delete one specific row and column from the matrix and 
therefore also reduce the dimension.
How is the smartest way to do so?

Thanks in advance!
Lydia

Re: Scatter-Gather Iteration aggregators

2016-05-13 Thread Lydia Ickler
Hi Vasia, 

okay I understand now :)
So it works fine if I want to collect the sum of values.
But what if I need to reset the DoubleSumAggregator back to 0 in order to then 
set it to a new value to save the absolute maximum?
Please have a look at the code above. 

Any idea why it is not working?
 

public static class VertexDistanceUpdater extends VertexUpdateFunction {

DoubleSumAggregator aggregator = new DoubleSumAggregator();

public void preSuperstep() {
// retrieve the Aggregator
aggregator = getIterationAggregator("sumAggregator");
}

public void updateVertex(Vertex vertex, 
MessageIterator inMessages) {
double sum = 0;
for (double msg : inMessages) {
sum = sum + (msg);
}

if((Math.abs(sum) > Math.abs(aggregator.getAggregate().getValue({

aggregator.reset();
aggregator.aggregate(sum);

}
setNewVertexValue(sum);
}
}



> Am 13.05.2016 um 09:25 schrieb Vasiliki Kalavri :
> 
> Hi Lydia, 
> 
> an iteration aggregator combines all aggregates globally once per superstep 
> and makes them available in the *next* superstep.
> Within each scatter-gather iteration, one MessagingFunction (scatter phase) 
> and one VertexUpdateFunction (gather phase) are executed. Thus, if you set an 
> aggregate value within one of those, the value will be available in the next 
> superstep. You can retrieve it calling the getPreviousIterationAggregate() 
> method.
> Let me know if that clears things up!
> 
> -Vasia.
> 
> On 13 May 2016 at 08:57, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi Vasia, 
> 
> yes, but only independently within each Function or not?
> 
> If I set the aggregator in VertexUpdateFunction then the newly set value is 
> not visible in the MessageFunction.
> Or am I doing something wrong? I would like to have a shared aggregator to 
> normalize vertices.
> 
> 
>> Am 13.05.2016 um 08:04 schrieb Vasiliki Kalavri > <mailto:vasilikikala...@gmail.com>>:
>> 
>> Hi Lydia,
>> 
>> registered aggregators through the ScatterGatherConfiguration are accessible 
>> both in the VertexUpdateFunction and in the MessageFunction.
>> 
>> Cheers,
>> -Vasia.
>> 
>> On 12 May 2016 at 20:08, Lydia Ickler > <mailto:ickle...@googlemail.com>> wrote:
>> Hi,
>> 
>> I have a question regarding the Aggregators of a Scatter-Gather Iteration.
>> Is it possible to have a global aggregator that is accessible in 
>> VertexUpdateFunction() and MessagingFunction() at the same time?
>> 
>> Thanks in advance,
>> Lydia
>> 
> 
> 



Re: Scatter-Gather Iteration aggregators

2016-05-12 Thread Lydia Ickler
Hi Vasia, 

yes, but only independently within each Function or not?

If I set the aggregator in VertexUpdateFunction then the newly set value is not 
visible in the MessageFunction.
Or am I doing something wrong? I would like to have a shared aggregator to 
normalize vertices.


> Am 13.05.2016 um 08:04 schrieb Vasiliki Kalavri :
> 
> Hi Lydia,
> 
> registered aggregators through the ScatterGatherConfiguration are accessible 
> both in the VertexUpdateFunction and in the MessageFunction.
> 
> Cheers,
> -Vasia.
> 
> On 12 May 2016 at 20:08, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi,
> 
> I have a question regarding the Aggregators of a Scatter-Gather Iteration.
> Is it possible to have a global aggregator that is accessible in 
> VertexUpdateFunction() and MessagingFunction() at the same time?
> 
> Thanks in advance,
> Lydia
> 



Scatter-Gather Iteration aggregators

2016-05-12 Thread Lydia Ickler
Hi,

I have a question regarding the Aggregators of a Scatter-Gather Iteration.
Is it possible to have a global aggregator that is accessible in 
VertexUpdateFunction() and MessagingFunction() at the same time?

Thanks in advance,
Lydia

normalize vertex values

2016-05-12 Thread Lydia Ickler
Hi all,

If I have a Graph g: Graph g
and I would like to normalize all vertex values by the absolute max of all 
vertex values -> what API function would I choose?

Thanks in advance!
Lydia

Re: Find differences

2016-04-07 Thread Lydia Ickler
 Nevermind! I figured it out with groupby and
Reducegroup

Von meinem iPhone gesendet

> Am 07.04.2016 um 11:51 schrieb Lydia Ickler :
> 
> Hi,
> 
> If i have 2 DataSets A and B of Type Tuple3 how would 
> I get a subset of A (based on the fields (0,1)) that does not occur in B?
> Is there maybe an already implemented method?
> 
> Best regards,
> Lydia
> 
> Von meinem iPhone gesendet


Find differences

2016-04-07 Thread Lydia Ickler
Hi,

If i have 2 DataSets A and B of Type Tuple3 how would I 
get a subset of A (based on the fields (0,1)) that does not occur in B?
Is there maybe an already implemented method?

Best regards,
Lydia

Von meinem iPhone gesendet

varying results: local VS cluster

2016-04-04 Thread Lydia Ickler
Hi all,

I have an issue regarding execution on 1 machine VS 5 machines.
If I execute the following code the results are not the same though I would 
expect them to be since the input file is the same.
Do you have any suggestions?

Thanks in advance!
Lydia
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameters);

//read input file
DataSet> matrixA = readMatrix(env, 
parameters.get("input"));
//Approximate EigenVector by PowerIteration
//get initial vector - which equals matrixA * [1, ... , 1]
DataSet> initial0 = 
(matrixA.groupBy(0)).sum(2);
DataSet> maximum = initial0.maxBy(2);
//normalize by maximum value
DataSet> initial= 
(initial0.cross(maximum)).map(new normalizeByMax());

//BulkIteration to find dominant eigenvector
IterativeDataSet> iteration = 
initial.iterate(1);

DataSet> intermediate = 
((matrixA.join(iteration).where(1).equalTo(0))
.map(new ProjectJoinResultMapper())).groupBy(0, 
1)).sum(2)).groupBy(0)).sum(2)).
cross(matrixA.join(iteration).where(1).equalTo(0))
.map(new ProjectJoinResultMapper())).groupBy(0, 
1)).sum(2))).groupBy(0)).sum(2)).sum(2)))
.map(new normalizeByMax());

DataSet> diffs = 
(iteration.join(intermediate).where(0).equalTo(0)).with(new deltaFilter());
DataSet> eigenVector  = 
iteration.closeWith(intermediate,diffs);

eigenVector.writeAsCsv(parameters.get("output"));
env.execute("Power Iteration");



Re: wait until BulkIteration finishes

2016-03-31 Thread Lydia Ickler
Hi,

thanks for your reply!
Could you please give me an example for the close() step? I can’t find an 
example online only for open().
There I can „save“ my new result?

Best regards,
Lydia


> Am 31.03.2016 um 18:16 schrieb Stephan Ewen :
> 
> Hi Lydia!
> 
> The same function instances (for example MapFunction objects) are used across 
> all supersteps. If, for example, you store something in a HashMap inside some 
> MapFunction, you can access that in the next iteration superstep.
> 
> You can figure out when a superstep finished and when the next superstep 
> starts, by overriding "open()" and "close()" from the RichFunction interface.
> 
> Stephan
> 
> 
> On Thu, Mar 31, 2016 at 4:45 PM, Till Rohrmann  <mailto:trohrm...@apache.org>> wrote:
> I think I don't completely understand your question.
> 
> On Thu, Mar 31, 2016 at 4:40 PM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi Till, 
> 
> thanks for your reply!
> Is there a way to store intermediate results of the bulk iteration to use 
> then in the next iteration except the data set one sends already by default?
> 
> Best regards, 
> Lydia
> 
> 
>> Am 31.03.2016 um 12:01 schrieb Till Rohrmann > <mailto:trohrm...@apache.org>>:
>> 
>> Hi Lydia,
>> 
>> all downstream operators which depend on the bulk iteration will wait 
>> implicitly until data from the iteration operator is available.
>> 
>> Cheers,
>> Till
>> 
>> On Thu, Mar 31, 2016 at 9:39 AM, Lydia Ickler > <mailto:ickle...@googlemail.com>> wrote:
>> Hi all,
>> 
>> is there a way to tell the program that it should wait until the 
>> BulkIteration finishes before the rest of the program is executed?
>> 
>> Best regards,
>> Lydia
>> 
> 
> 
> 



Re: wait until BulkIteration finishes

2016-03-31 Thread Lydia Ickler
Hi Till, 

thanks for your reply!
Is there a way to store intermediate results of the bulk iteration to use then 
in the next iteration except the data set one sends already by default?

Best regards, 
Lydia


> Am 31.03.2016 um 12:01 schrieb Till Rohrmann :
> 
> Hi Lydia,
> 
> all downstream operators which depend on the bulk iteration will wait 
> implicitly until data from the iteration operator is available.
> 
> Cheers,
> Till
> 
> On Thu, Mar 31, 2016 at 9:39 AM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi all,
> 
> is there a way to tell the program that it should wait until the 
> BulkIteration finishes before the rest of the program is executed?
> 
> Best regards,
> Lydia
> 



wait until BulkIteration finishes

2016-03-31 Thread Lydia Ickler
Hi all,

is there a way to tell the program that it should wait until the BulkIteration 
finishes before the rest of the program is executed?

Best regards,
Lydia

BulkIteration and BroadcastVariables

2016-03-30 Thread Lydia Ickler
Hi all,
I have a question regarding the BulkIteration and BroadcastVariables:
The BulkIteration by default has one input variable and sends one variable into 
the next iteration, right?
What if I need to collect some intermediate results in each iteration? How 
would I do that?

For example in my code below I would like to store all newEigenValue. 
Unfortunately I didn’t find a way to do so.
Is it possible to set/change BroadcastVariables? Or is it only possible to 
„get“ them?

Thanks in advance!
Lydia


//read input file
DataSet> matrixA = readMatrix(env, input);


//initial:
//Approximate EigenVector by PowerIteration
DataSet> eigenVector = 
PowerIteration_getEigenVector2(matrixA);
//Approximate EigenValue by PowerIteration
DataSet> oldEigenValue = 
PowerIteration_getEigenValue(matrixA,eigenVector);
//Deflate original matrix
matrixA = PowerIteration_getNextMatrix(matrixA,eigenVector,oldEigenValue);

DataSet> newEigenVector = null;
DataSet> newEigenValue = null;
DataSet> newMatrixA = null;


//BulkIteration to find k dominant eigenvalues
IterativeDataSet> iteration = 
matrixA.iterate(outer_iterations);

newEigenVector = PowerIteration_getEigenVector2(iteration);
newEigenValue = PowerIteration_getEigenValue(iteration,newEigenVector);
newMatrixA = 
PowerIteration_getNextMatrix(iteration,newEigenVector,newEigenValue);

//get gap
DataSet> gap = newEigenValue.map(new 
getGap()).withBroadcastSet(oldEigenValue, "oldEigenValue");
DataSet> filtered = gap.filter(new 
gapFilter());
oldEigenValue = newEigenValue;

DataSet> neue  = 
iteration.closeWith(newMatrixA,filtered);



for loop slow

2016-03-26 Thread Lydia Ickler
Hi,

I have an issue with a for-loop.
If I set the maximal iteration number i to more than 3 it gets stuck and I 
cannot figure out why.
With 1, 2 or 3 it runs smoothly.
I attached the code below and marked the loop with //PROBLEM.

Thanks in advance!
Lydia

package org.apache.flink.contrib.lifescience.examples;

import edu.princeton.cs.algs4.Graph;
import edu.princeton.cs.algs4.SymbolDigraph;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatJoinFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.aggregation.Aggregations;
import org.apache.flink.api.java.io.CsvReader;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.IterativeDataSet;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.contrib.lifescience.networks.algos.DataSetUtils;
import org.apache.flink.contrib.lifescience.networks.datatypes.networks.Network;
import 
org.apache.flink.contrib.lifescience.networks.datatypes.networks.NetworkEdge;
import 
org.apache.flink.contrib.lifescience.networks.datatypes.networks.NetworkNode;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.util.Collector;

import java.util.*;

import static edu.princeton.cs.algs4.GraphGenerator.simple;

public class PowerIteration {

//path to input
static String input = null;
//path to output
static String output = null;
//number of iterations (default = 7)
static int iterations = 7;
//threshold
static double delta = 0.01;

public void run() throws Exception {
ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();

//read input file
DataSet> matrixA = readMatrix(env, 
input);

DataSet> eigenVector;
DataSet> eigenValue;

//initial:
//Approximate EigenVector by PowerIteration
eigenVector = PowerIteration_getEigenVector(matrixA);
//Approximate EigenValue by PowerIteration
eigenValue = PowerIteration_getEigenValue(matrixA,eigenVector);
//Deflate original matrix
matrixA = PowerIteration_getNextMatrix(matrixA,eigenVector,eigenValue);

MyResult initial = new MyResult(eigenVector,eigenValue,matrixA);

MyResult next = null;

//PROBLEM!!! get i eigenvalue gaps
for(int i=0;i<2;i++){
next = PowerIteration_routine(initial);
initial = next;
next.gap.print();
}

env.execute("Power Iteration");
}

public static DataSource> 
readMatrix(ExecutionEnvironment env,
  
String filePath) {
CsvReader csvReader = env.readCsvFile(filePath);
csvReader.fieldDelimiter(",");
csvReader.includeFields("ttt");
return csvReader.types(Integer.class, Integer.class, Double.class);
}

public static final class ProjectJoinResultMapper implements
MapFunction,
Tuple3>,
Tuple3> {
@Override
public Tuple3 map(
Tuple2, Tuple3> value)
throws Exception {
Integer row = value.f0.f0;
Integer column = value.f1.f1;
Double product = value.f0.f2 * value.f1.f2;
return new Tuple3(row, column, product);
}
}

public static final class RQ implements
MapFunction, 
Tuple3>,
Tuple3> {

@Override
public Tuple3 map(
Tuple2, Tuple3> value)
throws Exception {

return new Tuple3(value.f0.f0,value.f0.f1,value.f0.f2/value.f1.f2);
}
}

public static void main(String[] args) throws Exception {
if(args.length<2 || args.length > 4){
System.err.println("Usage: PowerIteration   optional:  ");
System.exit(0);
}

input = args[0];
output = args[1];

if(args.length==3) {
iterations = Integer.parseInt(args[2]);
}
if(args.length==4){
delta = Double.parseDouble(args[3]);
}

new PowerIteration2().run();
}

public static final class deltaFilter implements 
FlatJoinFunction,Tuple3,Tuple3> {

public void join(Tuple3 candidate, 
Tuple3 old, Collector> out) {

if(!(candidate.f2 == old.f2)){
out.collect(candidate);
}

//if(Math.abs(candidate.f2-old.f2) > delta){
//out.collect(candidate);
//}

}
}

public static final class normalizeByMax implements
MapFunction, 
Tuple3>,
Tuple3> {

public Tuple3 map(
Tuple2, Tuple3> value)
   

Re: normalizing DataSet with cross()

2016-03-22 Thread Lydia Ickler
Sorry I was not clear: 
I meant the initial DataSet is changing. Not the ds. :)

  
> Am 22.03.2016 um 15:28 schrieb Till Rohrmann :
> 
> From the code extract I cannot tell what could be wrong because the code 
> looks ok. If ds changes, then your normalization result should change as 
> well, I would assume.
> 
> 
> On Tue, Mar 22, 2016 at 3:15 PM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi Till,
> 
> maybe it is doing so because I rewrite the ds in the next step again and then 
> the working steps get mixed?
> I am reading the data from a local .csv file with readMatrix(env, „filename")
> 
> See code below.
> 
> Best regards,
> Lydia
> 
> //read input file
> DataSet> ds = readMatrix(env, input);
> 
> /
>  POWER ITERATION
>  */
> 
> //get initial vector - which equals matrixA * [1, ... , 1]
> DataSet> initial = 
> ds(0).aggregate(Aggregations.SUM,2);
> 
> //normalize by maximum value
> initial = initial.cross(initial.aggregate(Aggregations.MAX, 2)).map(new 
> normalizeByMax());
> public static DataSource> 
> readMatrix(ExecutionEnvironment env,
>   String 
> filePath) {
> CsvReader csvReader = env.readCsvFile(filePath);
> csvReader.fieldDelimiter(",");
> csvReader.includeFields("ttt");
> return csvReader.types(Integer.class, Integer.class, Double.class);
> }
> 
>> Am 22.03.2016 um 14:47 schrieb Till Rohrmann > <mailto:trohrm...@apache.org>>:
>> 
>> Hi Lydia,
>> 
>> I tried to reproduce your problem but I couldn't. Can it be that you have 
>> somewhere a non deterministic operation in your program or do you read the 
>> data from a source with varying data? Maybe you could send us a compilable 
>> and complete program which reproduces your problem.
>> 
>> Cheers,
>> Till
>> 
>> On Tue, Mar 22, 2016 at 2:21 PM, Lydia Ickler > <mailto:ickle...@googlemail.com>> wrote:
>> Hi all,
>> 
>> I have a question.
>> If I have a DataSet DataSet> ds and I want 
>> to normalize all values (at position 2) in it by the maximum of the DataSet 
>> (ds.aggregate(Aggregations.MAX, 2)). 
>> How do I tackle that?
>> 
>> If I use the cross operator my result changes every time I run the program 
>> (see code below)
>> Any suggestions?
>> 
>> Thanks in advance!
>> Lydia
>> ds.cross(ds.aggregate(Aggregations.MAX, 2)).map(new normalizeByMax());
>> public static final class normalizeByMax implements
>> MapFunction, Tuple3> Integer, Double>>,
>> Tuple3> {
>> 
>> public Tuple3 map(
>> Tuple2, Tuple3> Integer, Double>> value)
>> throws Exception {
>> return new Tuple3> Double>(value.f0.f0,value.f0.f1,value.f0.f2/value.f1.f2);
>> }
>> }
>> 
>> 
>> 
> 
> 



Re: normalizing DataSet with cross()

2016-03-22 Thread Lydia Ickler
Hi Till,

maybe it is doing so because I rewrite the ds in the next step again and then 
the working steps get mixed?
I am reading the data from a local .csv file with readMatrix(env, „filename")

See code below.

Best regards,
Lydia

//read input file
DataSet> ds = readMatrix(env, input);

/
 POWER ITERATION
 */

//get initial vector - which equals matrixA * [1, ... , 1]
DataSet> initial = 
ds(0).aggregate(Aggregations.SUM,2);

//normalize by maximum value
initial = initial.cross(initial.aggregate(Aggregations.MAX, 2)).map(new 
normalizeByMax());
public static DataSource> 
readMatrix(ExecutionEnvironment env,
  String 
filePath) {
CsvReader csvReader = env.readCsvFile(filePath);
csvReader.fieldDelimiter(",");
csvReader.includeFields("ttt");
return csvReader.types(Integer.class, Integer.class, Double.class);
}

> Am 22.03.2016 um 14:47 schrieb Till Rohrmann :
> 
> Hi Lydia,
> 
> I tried to reproduce your problem but I couldn't. Can it be that you have 
> somewhere a non deterministic operation in your program or do you read the 
> data from a source with varying data? Maybe you could send us a compilable 
> and complete program which reproduces your problem.
> 
> Cheers,
> Till
> 
> On Tue, Mar 22, 2016 at 2:21 PM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi all,
> 
> I have a question.
> If I have a DataSet DataSet> ds and I want 
> to normalize all values (at position 2) in it by the maximum of the DataSet 
> (ds.aggregate(Aggregations.MAX, 2)). 
> How do I tackle that?
> 
> If I use the cross operator my result changes every time I run the program 
> (see code below)
> Any suggestions?
> 
> Thanks in advance!
> Lydia
> ds.cross(ds.aggregate(Aggregations.MAX, 2)).map(new normalizeByMax());
> public static final class normalizeByMax implements
> MapFunction, Tuple3 Integer, Double>>,
> Tuple3> {
> 
> public Tuple3 map(
> Tuple2, Tuple3 Double>> value)
> throws Exception {
> return new Tuple3 Double>(value.f0.f0,value.f0.f1,value.f0.f2/value.f1.f2);
> }
> }
> 
> 
> 



normalizing DataSet with cross()

2016-03-22 Thread Lydia Ickler
Hi all,

I have a question.
If I have a DataSet DataSet> ds and I want to 
normalize all values (at position 2) in it by the maximum of the DataSet 
(ds.aggregate(Aggregations.MAX, 2)). 
How do I tackle that?

If I use the cross operator my result changes every time I run the program (see 
code below)
Any suggestions?

Thanks in advance!
Lydia
ds.cross(ds.aggregate(Aggregations.MAX, 2)).map(new normalizeByMax());
public static final class normalizeByMax implements
MapFunction, Tuple3>,
Tuple3> {

public Tuple3 map(
Tuple2, Tuple3> value)
throws Exception {
return new Tuple3(value.f0.f0,value.f0.f1,value.f0.f2/value.f1.f2);
}
}




Help with DeltaIteration

2016-03-19 Thread Lydia Ickler
Hi,
I have a question regarding the Delta Iteration. 
I basically want to iterate as long as the former and the new calculated set 
are different. Stop if they are the same.

Right now I get a result set that has entries with duplicate „row“ indices 
which should not be the case.
I guess I am doing something wrong in the iteration.closeWith(intermediate, 
diffs); Maybe I am sending only parts of the set but for the Multiplication 
(ProjectJoinResultMapper()) I need the whole DataSet.
Could somebody please hint me in the right direction?

Thanks in advance!

This is what I have right now:
DataSet> initial = matrixA.groupBy(0).sum();

//normalize by maximum value
initial = initial.cross(initial.sum(2)).map(new normalizeByMax());
DeltaIteration,Tuple3> iteration = initial.iterateDelta(initial, 1, 0,1);

DataSet> intermediate = 
matrixA.join(iteration.getWorkset()).where(1).equalTo(0)
.map(new ProjectJoinResultMapper()).groupBy(0, 
1).sum(2).groupBy(0).sum(2).cross(matrixA.join(iteration.getWorkset()).where(1).equalTo(0)
.map(new ProjectJoinResultMapper()).groupBy(0, 
1).sum(2).groupBy(0).sum(2).max(2)).map(new normalizeByMax());
DataSet> diffs = 
intermediate.join(iteration.getSolutionSet()).where(0,1).equalTo(0,1).with(new 
ComponentIdFilter());
DataSet> result = 
iteration.closeWith(intermediate, diffs);
public static final class ComponentIdFilter implements 
FlatJoinFunction,Tuple3,Tuple3> {

public void join(Tuple3 candidate, 
Tuple3 old, Collector> out) {

if(!candidate.f2.equals(old.f2)){
out.collect(candidate);
}
}
}



MatrixMultiplication

2016-03-14 Thread Lydia Ickler
Hi, 

I wrote to you before about the MatrixMultiplication in Flink … Unfortunately, 
the multiplication of a pair of 1000 x 1000 matrices is taking already almost a 
minute.
Would you please take a look at my attached code. Maybe you can suggest 
something to make it faster?
Or would it be better to tackle the problem with the Gelly API? (Since the 
matrix is an adjacency matrix). And if so how would you tackle it?

Thanks in advance and best regards, 
Lydia

package de.tuberlin.dima.aim3.assignment3;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.io.CsvReader;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.GroupReduceOperator;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.DataSet;


public class MatrixMultiplication {

   static String input = null;
   static String output = null;

   public void run() throws Exception {
  ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

  DataSet> matrixA = readMatrix(env, 
input);

  matrixA.join(matrixA).where(1).equalTo(0)
.map(new ProjectJoinResultMapper()).groupBy(0, 
1).sum(2).writeAsCsv(output);

 
  env.execute();
   }



   public static DataSource> 
readMatrix(ExecutionEnvironment env,
 String filePath) {
  CsvReader csvReader = env.readCsvFile(filePath);
  csvReader.fieldDelimiter(',');
  csvReader.includeFields("fttt");
  return csvReader.types(Integer.class, Integer.class, Double.class);
   }

   public static final class ProjectJoinResultMapper implements
MapFunction,
   Tuple3>,
  Tuple3> {
  @Override
  public Tuple3 map(
Tuple2, Tuple3> value)
throws Exception {
 Integer row = value.f0.f0;
 Integer column = value.f1.f1;
 Double product = value.f0.f2 * value.f1.f2;
 return new Tuple3(row, column, product);
  }
   }

  
   public static void main(String[] args) throws Exception {
  if(args.length<2){
 System.err.println("Usage: MatrixMultiplication  ");
 System.exit(0);
  }
  input = args[0];
  output = args[1];
  new MatrixMultiplication().run();
   }

}



filter dataset

2016-02-29 Thread Lydia Ickler
Hi all, 

I have a DataSet and I want to apply a filter 
to only get back all entries with e.g. first Integer in tuple == 0. 
With a normal filter I do not have the possibility to pass an an additional 
argument but I have to set that parameter inside the filter function. 
Is there a possibility to send a parameter to make the filter more flexible? 
How would be the smartest way to do so?

Best regards, 
Lydia

DistributedMatrix in Flink

2016-02-04 Thread Lydia Ickler
Hi all,

as mentioned before I am trying to import the RowMatrix from Spark to Flink…

In the code I already ran into a dead end… In the function 
multiplyGramianMatrixBy() (see end of mail) there is the line: 
rows.context.broadcast(v) (rows is a DataSet[Vector]
What exactly is this line doing? Does it fill the „content“ of v into the 
variable rows?
And another question:
What is the function treeAggregate doing ? And how would you tackle a „copy“ of 
that in Flink?

Thanks in advance!
Best regards, 
Lydia


private[flink] def multiplyGramianMatrixBy(v: DenseVector[Double]): 
DenseVector[Double] = {
  val n = numCols().toInt

  val vbr = rows.context.broadcast(v)

  rows.treeAggregate(BDV.zeros[Double](n))(
seqOp = (U, r) => {
  val rBrz = r.toBreeze
  val a = rBrz.dot(vbr.data)
  rBrz match {
// use specialized axpy for better performance
case _: BDV[_] => brzAxpy(a, rBrz.asInstanceOf[BDV[Double]], U)
case _: BSV[_] => brzAxpy(a, rBrz.asInstanceOf[BSV[Double]], U)
case _ => throw new UnsupportedOperationException(
  s"Do not support vector operation from type 
${rBrz.getClass.getName}.")
  }
  U
}, combOp = (U1, U2) => U1 += U2)
}



Re: cluster execution

2016-02-01 Thread Lydia Ickler
xD…  a simple "hdfs dfs -chmod -R 777 /users" fixed it!


> Am 01.02.2016 um 12:17 schrieb Till Rohrmann :
> 
> Hi Lydia,
> 
> I looks like that. I guess you should check your hdfs access rights. 
> 
> Cheers,
> Till
> 
> On Mon, Feb 1, 2016 at 11:28 AM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi Till,
> 
> thanks for your reply!
> I tested it with the Wordcount example.
> Everything works fine if I run the command:
> ./flink run -p 3 /home/flink/examples/WordCount.jar
> Then the program gets executed by my 3 workers. 
> If I want to save the output to a file:
> ./flink run -p 3 /home/flink/examples/WordCount.jar 
> hdfs://grips2:9000/users/Flink_1000.csv <> 
> hdfs://grips2:9000/users/Wordcount_1000 <>
> 
> I get the following error message:
> What am I doing wrong? Is something wrong with my cluster writing permissions?
> 
> org.apache.flink.client.program.ProgramInvocationException: The program 
> execution failed: Cannot initialize task 'DataSink (CsvOutputFormat (path: 
> hdfs://grips2:9000/users/Wordcount_1000s <>, delimiter:  ))': Output 
> directory could not be created.
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:370)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:348)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:315)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:70)
>   at 
> org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:78)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:497)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:395)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:252)
>   at 
> org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:676)
>   at org.apache.flink.client.CliFrontend.run(CliFrontend.java:326)
>   at 
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:978)
>   at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1028)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Cannot 
> initialize task 'DataSink (CsvOutputFormat (path: 
> hdfs://grips2:9000/users/Wordcount_1000s <>, delimiter:  ))': Output 
> directory could not be created.
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$6.apply(JobManager.scala:867)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$6.apply(JobManager.scala:851)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.flink.runtime.jobmanager.JobManager.org 
> <http://org.apache.flink.runtime.jobmanager.jobmanager.org/>$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:851)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:341)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction

Re: cluster execution

2016-02-01 Thread Lydia Ickler
rkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.IOException: Output directory could not be created.
at 
org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:295)
at 
org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:84)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$6.apply(JobManager.scala:863)
... 29 more

The exception above occurred while trying to run your command.


> Am 28.01.2016 um 10:44 schrieb Till Rohrmann :
> 
> Hi Lydia,
> 
> what do you mean with master? Usually when you submit a program to the 
> cluster and don’t specify the parallelism in your program, then it will be 
> executed with the parallelism.default value as parallelism. You can specify 
> the value in your cluster configuration flink-config.yaml file. Alternatively 
> you can always specify the parallelism via the CLI client with the -p option.
> 
> Cheers,
> Till
> 
> 
> On Thu, Jan 28, 2016 at 9:53 AM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi all,
> 
> I am doing some operations on a DataSet> … 
> (see code below)
> When I run my program on a cluster with 3 machines I can see within the web 
> client that only my master is executing the program. 
> Do I have to specify somewhere that all machines have to participate? Usually 
> the cluster executes in parallel.
> 
> Any suggestions?
> 
> Best regards, 
> Lydia
> DataSet> matrixA = readMatrix(env, input);
> DataSet> initial = matrixA.groupBy(0).sum(2);
> 
> //normalize by maximum value
> initial = initial.cross(initial.max(2)).map(new normalizeByMax());
> matrixA.join(initial).where(1).equalTo(0)
>   .map(new ProjectJoinResultMapper()).groupBy(0, 1).sum(2);
> 
> 



cluster execution

2016-01-28 Thread Lydia Ickler
Hi all,

I am doing some operations on a DataSet> … (see 
code below)
When I run my program on a cluster with 3 machines I can see within the web 
client that only my master is executing the program. 
Do I have to specify somewhere that all machines have to participate? Usually 
the cluster executes in parallel.

Any suggestions?

Best regards, 
Lydia
DataSet> matrixA = readMatrix(env, input);
DataSet> initial = matrixA.groupBy(0).sum(2);

//normalize by maximum value
initial = initial.cross(initial.max(2)).map(new normalizeByMax());
matrixA.join(initial).where(1).equalTo(0)
  .map(new ProjectJoinResultMapper()).groupBy(0, 1).sum(2);



Re: rowmatrix equivalent

2016-01-26 Thread Lydia Ickler
Hi Till,

maybe I will do that :) 
If I have some other questions I will let you know!

Best regards,
Lydia


> Am 24.01.2016 um 17:33 schrieb Till Rohrmann :
> 
> Hi Lydia,
> 
> Flink does not come with a distributed matrix implementation as Spark does it 
> with the RowMatrix, yet. However, you can easily implement it yourself. This 
> would also be a good contribution to the project if you want to tackle the 
> problem
> 
> Cheers,
> Till
> 
> On Sun, Jan 24, 2016 at 4:03 PM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi all,
> 
> this is maybe a stupid question but what within Flink is the equivalent to 
> Sparks’ RowMatrix ?
> 
> Thanks in advance,
> Lydia
> 



Re: MatrixMultiplication

2016-01-25 Thread Lydia Ickler
Hi Till,

thanks for your reply :)
Yes, it finished after ~27 minutes…

Best regards, 
Lydia

> Am 25.01.2016 um 14:27 schrieb Till Rohrmann :
> 
> Hi Lydia,
> 
> Since matrix multiplication is O(n^3), I would assume that it would simply 
> take 1000 times longer than the multiplication of the 100 x 100 matrix. Have 
> you waited so long to see whether it completes or is there another problem?
> 
> Cheers,
> Till
> 
> On Mon, Jan 25, 2016 at 2:13 PM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi, 
> 
> I want do a simple MatrixMultiplication and use the following code (see 
> bottom).
> For matrices 50x50 or 100x100 it is no problem. But already with matrices of 
> 1000x1000 it would not work anymore and gets stuck in the joining part. 
> What am I doing wrong?
> 
> Best regards, 
> Lydia
> 
> package de.tuberlin.dima.aim3.assignment3;
> 
> import org.apache.flink.api.common.functions.MapFunction;
> import org.apache.flink.api.java.ExecutionEnvironment;
> import org.apache.flink.api.java.io.CsvReader;
> import org.apache.flink.api.java.operators.DataSource;
> import org.apache.flink.api.java.operators.GroupReduceOperator;
> import org.apache.flink.api.java.tuple.Tuple2;
> import org.apache.flink.api.java.tuple.Tuple3;
> import org.apache.flink.api.java.DataSet;
> 
> 
> public class MatrixMultiplication {
> 
>static String input = null;
>static String output = null;
> 
>public void run() throws Exception {
>   ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment();
> 
>   DataSet> matrixA = readMatrix(env, 
> input);
> 
>   matrixA.join(matrixA).where(1).equalTo(0)
> .map(new ProjectJoinResultMapper()).groupBy(0, 
> 1).sum(2).writeAsCsv(output);
> 
>  
>   env.execute();
>}
> 
> 
> 
>public static DataSource> 
> readMatrix(ExecutionEnvironment env,
>  String filePath) {
>   CsvReader csvReader = env.readCsvFile(filePath);
>   csvReader.fieldDelimiter(',');
>   csvReader.includeFields("fttt");
>   return csvReader.types(Integer.class, Integer.class, Double.class);
>}
> 
>public static final class ProjectJoinResultMapper implements
> MapFunction,
>Tuple3>,
>   Tuple3> {
>   @Override
>   public Tuple3 map(
> Tuple2, Tuple3 Double>> value)
> throws Exception {
>  Integer row = value.f0.f0;
>  Integer column = value.f1.f1;
>  Double product = value.f0.f2 * value.f1.f2;
>  return new Tuple3(row, column, product);
>   }
>}
> 
>   
>public static void main(String[] args) throws Exception {
>   if(args.length<2){
>  System.err.println("Usage: MatrixMultiplication   path>");
>  System.exit(0);
>   }
>   input = args[0];
>   output = args[1];
>   new MatrixMultiplication().run();
>}
> 
> }
> 
> 



MatrixMultiplication

2016-01-25 Thread Lydia Ickler
Hi, 

I want do a simple MatrixMultiplication and use the following code (see bottom).
For matrices 50x50 or 100x100 it is no problem. But already with matrices of 
1000x1000 it would not work anymore and gets stuck in the joining part. 
What am I doing wrong?

Best regards, 
Lydia

package de.tuberlin.dima.aim3.assignment3;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.io.CsvReader;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.GroupReduceOperator;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.DataSet;


public class MatrixMultiplication {

   static String input = null;
   static String output = null;

   public void run() throws Exception {
  ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

  DataSet> matrixA = readMatrix(env, 
input);

  matrixA.join(matrixA).where(1).equalTo(0)
.map(new ProjectJoinResultMapper()).groupBy(0, 
1).sum(2).writeAsCsv(output);

 
  env.execute();
   }



   public static DataSource> 
readMatrix(ExecutionEnvironment env,
 String filePath) {
  CsvReader csvReader = env.readCsvFile(filePath);
  csvReader.fieldDelimiter(',');
  csvReader.includeFields("fttt");
  return csvReader.types(Integer.class, Integer.class, Double.class);
   }

   public static final class ProjectJoinResultMapper implements
MapFunction,
   Tuple3>,
  Tuple3> {
  @Override
  public Tuple3 map(
Tuple2, Tuple3> value)
throws Exception {
 Integer row = value.f0.f0;
 Integer column = value.f1.f1;
 Double product = value.f0.f2 * value.f1.f2;
 return new Tuple3(row, column, product);
  }
   }

  
   public static void main(String[] args) throws Exception {
  if(args.length<2){
 System.err.println("Usage: MatrixMultiplication  ");
 System.exit(0);
  }
  input = args[0];
  output = args[1];
  new MatrixMultiplication().run();
   }

}



rowmatrix equivalent

2016-01-24 Thread Lydia Ickler
Hi all,

this is maybe a stupid question but what within Flink is the equivalent to 
Sparks’ RowMatrix ?

Thanks in advance, 
Lydia

Re: eigenvalue solver

2016-01-12 Thread Lydia Ickler
Hi Till, 
Thanks for the Paper Link!
Do you have maybe a Code snippet in mind from BLAS, breeze or spark where to 
Start from?
Best regards, 
Lydia 


Von meinem iPhone gesendet
> Am 12.01.2016 um 10:46 schrieb Till Rohrmann :
> 
> Hi Lydia,
> 
> there is no Eigenvalue solver included in FlinkML yet. But if you want to, 
> then you can give it a try :-)
> 
> [1] http://www.cs.newpaltz.edu/~lik/publications/Ruixuan-Li-CCPE-2015.pdf
> 
> Cheers,
> Till
> 
>> On Tue, Jan 12, 2016 at 9:47 AM, Lydia Ickler  
>> wrote:
>> Hi,
>> 
>> I wanted to know if there are any implementations yet within the Machine 
>> Learning Library or generally that can efficiently solve eigenvalue problems 
>> in Flink?
>> Or if not do you have suggestions on how to approach a parallel execution 
>> maybe with BLAS or Breeze?
>> 
>> Thanks in advance!
>> Lydia
> 


eigenvalue solver

2016-01-12 Thread Lydia Ickler
Hi,

I wanted to know if there are any implementations yet within the Machine 
Learning Library or generally that can efficiently solve eigenvalue problems in 
Flink? 
Or if not do you have suggestions on how to approach a parallel execution maybe 
with BLAS or Breeze?

Thanks in advance!
Lydia

Re: writeAsCsv

2015-10-07 Thread Lydia Ickler
ok, thanks! :)

I will try that!



> Am 07.10.2015 um 21:35 schrieb Lydia Ickler :
> 
> Hi, 
> 
> stupid question: Why is this not saved to file?
> I want to transform an array to a DataSet but the Graph stops at collect().
> 
> //Transform Spectrum to DataSet
> List> dataList = new LinkedList String>>();
> double[][] arr = filteredSpectrum.getAs2DDoubleArray();
> for (int i=0;i dataList.add(new Tuple2( arr[0][i], arr[1][i]));
> }
> env.fromCollection(dataList).writeAsCsv(output);
> Best regards, 
> Lydia
> 



writeAsCsv

2015-10-07 Thread Lydia Ickler
Hi, 

stupid question: Why is this not saved to file?
I want to transform an array to a DataSet but the Graph stops at collect().

//Transform Spectrum to DataSet
List> dataList = new LinkedList>();
double[][] arr = filteredSpectrum.getAs2DDoubleArray();
for (int i=0;i

source binary file

2015-10-06 Thread Lydia Ickler
Hi,

how would I read a BinaryFile from HDFS with the Flink Java API?

I can only find the Scala way…

All the best,
Lydia

Re: data flow example on cluster

2015-10-02 Thread Lydia Ickler
Thanks, Till!
I used the ALS from FlinkML and it works :)

Best regards,
Lydia

> Am 02.10.2015 um 14:14 schrieb Till Rohrmann :
> 
> Hi Lydia,
> 
> I think the APIs of the versions 0.8-incubating-SNAPSHOT and 0.10-SNAPSHOT 
> are not compatible. Thus, it’s not just simply setting the dependencies to 
> 0.10-SNAPSHOT. You also have to fix the API changes. This might not be 
> trivial. Therefore, I’d recommend you to simply use the ALS implementation 
> which you can find in FlinkML.
> 
> Cheers,
> Till
> 
> 
> On Fri, Oct 2, 2015 at 2:11 PM, Robert Metzger  <mailto:rmetz...@apache.org>> wrote:
> Lydia,
> can you check the log of Flink installed on the cluster? During startup, it 
> is writing the exact commit your 0.10-SNAPSHOT is based on.
> 
> I would recommend to check out exactly that commit locally and then build 
> Flink locally. After that, you can rebuild your jobs jar again.
> With that method, there is certainly no version mismatch.
> 
> (Thats one of the reasons why using SNAPSHOT versions on such environments is 
> not recommended)
> 
> 
> 
> 
> On Fri, Oct 2, 2015 at 2:05 PM, Stefano Bortoli  <mailto:s.bort...@gmail.com>> wrote:
> I had problems running a flink job with maven, probably there is some issue 
> of classloading. For me worked to run a simple java command with the uberjar. 
> So I build the jar using maven, and then run it this way
> 
> java -Xmx2g -cp target/youruberjar.jar yourclass arg1 arg2
> 
> hope it helps,
> Stefano
> 
> 2015-10-02 12:21 GMT+02:00 Lydia Ickler  <mailto:ickle...@googlemail.com>>:
> Hi,
> 
> I did not create anything by myself.
> I just downloaded the files from here: 
> https://github.com/tillrohrmann/flink-perf 
> <https://github.com/tillrohrmann/flink-perf>
> 
> And then executed mvn clean install -DskipTests
> 
> Then I opened the project within IntelliJ and there it works fine.
> Then I exported it to the cluster that runs with 0.10-SNAPSHOT.
> 
> 
>> Am 02.10.2015 um 12:15 schrieb Stephan Ewen > <mailto:se...@apache.org>>:
>> 
>> @Lydia  Did you create your POM files for your job with an 0.8.x quickstart?
>> 
>> Can you try to simply re-create your project's POM files with a new 
>> quickstart?
>> 
>> I think that the POMS between 0.8-incubating-SNAPSHOT and 0.10-SNAPSHOT may 
>> not be quite compatible any more...
>> 
>> On Fri, Oct 2, 2015 at 12:07 PM, Robert Metzger > <mailto:rmetz...@apache.org>> wrote:
>> Are you relying on a feature only available in 0.10-SNAPSHOT?
>> Otherwise, I would recommend to use the latest stable release (0.9.1) for 
>> your flink job and on the cluster.
>> 
>> On Fri, Oct 2, 2015 at 11:55 AM, Lydia Ickler > <mailto:ickle...@googlemail.com>> wrote:
>> Hi,
>> 
>> but inside the pom of flunk-job is the flink version set to 0.8
>> 
>>  0.8-incubating-SNAPSHOT
>> 
>> how can I change it to the newest?
>>  0.10-SNAPSHOT
>> Is not working
>> 
>>> Am 02.10.2015 um 11:48 schrieb Robert Metzger >> <mailto:rmetz...@apache.org>>:
>>> 
>>> I think there is a version mismatch between the Flink version you've used 
>>> to compile your job and the Flink version installed on the cluster.
>>> 
>>> Maven automagically pulls newer 0.10-SNAPSHOT versions every time you're 
>>> building your job.
>>> 
>>> On Fri, Oct 2, 2015 at 11:45 AM, Lydia Ickler >> <mailto:ickle...@googlemail.com>> wrote:
>>> Hi Till,
>>> I want to execute your Matrix Completion program „ALSJoin“.
>>> 
>>> Locally it works perfect.
>>> Now I want to execute it on the cluster with:
>>> 
>>> run -c com.github.projectflink.als.ALSJoin -cp 
>>> /tmp/icklerly/flink-jobs-0.1-SNAPSHOT.jar 0 2 0.001 10 1 1
>>> 
>>> but I get the following error:
>>> java.lang.NoSuchMethodError: 
>>> org.apache.flink.api.scala.typeutils.CaseClassTypeInfo.(Ljava/lang/Class;Lscala/collection/Seq;Lscala/collection/Seq;)V
>>> 
>>> I guess something like the flink-scala-0.10-SNAPSHOT.jar is missing.
>>> How can I add that to the path?
>>> 
>>> Best regards,
>>> Lydia
>>> 
>>> 
>> 
>> 
>> 
> 
> 
> 
> 



Re: data flow example on cluster

2015-10-02 Thread Lydia Ickler
Hi,

I did not create anything by myself.
I just downloaded the files from here: 
https://github.com/tillrohrmann/flink-perf 
<https://github.com/tillrohrmann/flink-perf>

And then executed mvn clean install -DskipTests

Then I opened the project within IntelliJ and there it works fine.
Then I exported it to the cluster that runs with 0.10-SNAPSHOT.


> Am 02.10.2015 um 12:15 schrieb Stephan Ewen :
> 
> @Lydia  Did you create your POM files for your job with an 0.8.x quickstart?
> 
> Can you try to simply re-create your project's POM files with a new 
> quickstart?
> 
> I think that the POMS between 0.8-incubating-SNAPSHOT and 0.10-SNAPSHOT may 
> not be quite compatible any more...
> 
> On Fri, Oct 2, 2015 at 12:07 PM, Robert Metzger  <mailto:rmetz...@apache.org>> wrote:
> Are you relying on a feature only available in 0.10-SNAPSHOT?
> Otherwise, I would recommend to use the latest stable release (0.9.1) for 
> your flink job and on the cluster.
> 
> On Fri, Oct 2, 2015 at 11:55 AM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi,
> 
> but inside the pom of flunk-job is the flink version set to 0.8
> 
>   0.8-incubating-SNAPSHOT
> 
> how can I change it to the newest?
>   0.10-SNAPSHOT
> Is not working
> 
>> Am 02.10.2015 um 11:48 schrieb Robert Metzger > <mailto:rmetz...@apache.org>>:
>> 
>> I think there is a version mismatch between the Flink version you've used to 
>> compile your job and the Flink version installed on the cluster.
>> 
>> Maven automagically pulls newer 0.10-SNAPSHOT versions every time you're 
>> building your job.
>> 
>> On Fri, Oct 2, 2015 at 11:45 AM, Lydia Ickler > <mailto:ickle...@googlemail.com>> wrote:
>> Hi Till,
>> I want to execute your Matrix Completion program „ALSJoin“.
>> 
>> Locally it works perfect.
>> Now I want to execute it on the cluster with:
>> 
>> run -c com.github.projectflink.als.ALSJoin -cp 
>> /tmp/icklerly/flink-jobs-0.1-SNAPSHOT.jar 0 2 0.001 10 1 1
>> 
>> but I get the following error:
>> java.lang.NoSuchMethodError: 
>> org.apache.flink.api.scala.typeutils.CaseClassTypeInfo.(Ljava/lang/Class;Lscala/collection/Seq;Lscala/collection/Seq;)V
>> 
>> I guess something like the flink-scala-0.10-SNAPSHOT.jar is missing.
>> How can I add that to the path?
>> 
>> Best regards,
>> Lydia
>> 
>> 
> 
> 
> 



Re: data flow example on cluster

2015-10-02 Thread Lydia Ickler

> Am 02.10.2015 um 11:55 schrieb Lydia Ickler :
> 
> 0.10-SNAPSHOT



Re: data flow example on cluster

2015-10-02 Thread Lydia Ickler
It is a university cluster which we have to use.
So I am forced to use it :(

How can I bypass that version conflict?


> Am 02.10.2015 um 12:07 schrieb Robert Metzger :
> 
> Are you relying on a feature only available in 0.10-SNAPSHOT?
> Otherwise, I would recommend to use the latest stable release (0.9.1) for 
> your flink job and on the cluster.
> 
> On Fri, Oct 2, 2015 at 11:55 AM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi,
> 
> but inside the pom of flunk-job is the flink version set to 0.8
> 
>   0.8-incubating-SNAPSHOT
> 
> how can I change it to the newest?
>   0.10-SNAPSHOT
> Is not working
> 
>> Am 02.10.2015 um 11:48 schrieb Robert Metzger > <mailto:rmetz...@apache.org>>:
>> 
>> I think there is a version mismatch between the Flink version you've used to 
>> compile your job and the Flink version installed on the cluster.
>> 
>> Maven automagically pulls newer 0.10-SNAPSHOT versions every time you're 
>> building your job.
>> 
>> On Fri, Oct 2, 2015 at 11:45 AM, Lydia Ickler > <mailto:ickle...@googlemail.com>> wrote:
>> Hi Till,
>> I want to execute your Matrix Completion program „ALSJoin“.
>> 
>> Locally it works perfect.
>> Now I want to execute it on the cluster with:
>> 
>> run -c com.github.projectflink.als.ALSJoin -cp 
>> /tmp/icklerly/flink-jobs-0.1-SNAPSHOT.jar 0 2 0.001 10 1 1
>> 
>> but I get the following error:
>> java.lang.NoSuchMethodError: 
>> org.apache.flink.api.scala.typeutils.CaseClassTypeInfo.(Ljava/lang/Class;Lscala/collection/Seq;Lscala/collection/Seq;)V
>> 
>> I guess something like the flink-scala-0.10-SNAPSHOT.jar is missing.
>> How can I add that to the path?
>> 
>> Best regards,
>> Lydia
>> 
>> 
> 
> 



Re: data flow example on cluster

2015-10-02 Thread Lydia Ickler
Hi,

but inside the pom of flunk-job is the flink version set to 0.8

0.8-incubating-SNAPSHOT

how can I change it to the newest?
0.10-SNAPSHOT
Is not working

> Am 02.10.2015 um 11:48 schrieb Robert Metzger :
> 
> I think there is a version mismatch between the Flink version you've used to 
> compile your job and the Flink version installed on the cluster.
> 
> Maven automagically pulls newer 0.10-SNAPSHOT versions every time you're 
> building your job.
> 
> On Fri, Oct 2, 2015 at 11:45 AM, Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi Till,
> I want to execute your Matrix Completion program „ALSJoin“.
> 
> Locally it works perfect.
> Now I want to execute it on the cluster with:
> 
> run -c com.github.projectflink.als.ALSJoin -cp 
> /tmp/icklerly/flink-jobs-0.1-SNAPSHOT.jar 0 2 0.001 10 1 1
> 
> but I get the following error:
> java.lang.NoSuchMethodError: 
> org.apache.flink.api.scala.typeutils.CaseClassTypeInfo.(Ljava/lang/Class;Lscala/collection/Seq;Lscala/collection/Seq;)V
> 
> I guess something like the flink-scala-0.10-SNAPSHOT.jar is missing.
> How can I add that to the path?
> 
> Best regards,
> Lydia
> 
> 



Re: data flow example on cluster

2015-10-02 Thread Lydia Ickler
Hi Till,
I want to execute your Matrix Completion program „ALSJoin“.

Locally it works perfect.
Now I want to execute it on the cluster with:

run -c com.github.projectflink.als.ALSJoin -cp 
/tmp/icklerly/flink-jobs-0.1-SNAPSHOT.jar 0 2 0.001 10 1 1

but I get the following error:
java.lang.NoSuchMethodError: 
org.apache.flink.api.scala.typeutils.CaseClassTypeInfo.(Ljava/lang/Class;Lscala/collection/Seq;Lscala/collection/Seq;)V

I guess something like the flink-scala-0.10-SNAPSHOT.jar is missing.
How can I add that to the path?

Best regards,
Lydia



DataSet transformation

2015-10-01 Thread Lydia Ickler
Hi all,

so I have a case class Spectrum(mz: Float, intensity: Float)
and a DataSet[Spectrum] to read my data in.

Now I want to know if there is a smart way to transform my DataSet into a two 
dimensional Array ?

Thanks in advance,
Lydia



error message

2015-09-30 Thread Lydia Ickler
Hi,
what jar am I missing ?
The error is:
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.flink.api.scala.ExecutionEnvironment.readCsvFile$default$4()Z

data flow example on cluster

2015-09-29 Thread Lydia Ickler
Hi all, 

I want to run the data-flow Wordcount example on a Flink Cluster.
The local execution with „mvn exec:exec -Dinput=kinglear.txt 
-Doutput=wordcounts.txt“ is already working.
How is the command to execute it on the cluster?

Best regards,
Lydia 

Re: HBase issue

2015-09-24 Thread Lydia Ickler
I am really trying to get HBase to work...
Is there maybe a tutorial for all the config files?
Best regards,
Lydia


> Am 23.09.2015 um 17:40 schrieb Maximilian Michels :
> 
> In the issue, it states that it should be sufficient to append the 
> hbase-protocol.jar file to the Hadoop classpath. Flink respects the Hadoop 
> classpath and will append it to its own classpath upon launching a cluster.
> 
> To do that, you need to modify the classpath with one of the commands below. 
> Note that this has to be performed on all cluster nodes.
> 
> export HADOOP_CLASSPATH="${HADOOP_CLASSPATH}:/path/to/hbase-protocol.jar"
> export HADOOP_CLASSPATH="${HADOOP_CLASSPATH}:$(hbase mapredcp)"
> export HADOOP_CLASSPATH="${HADOOP_CLASSPATH}:$(hbase classpath)"
> 
> Alternatively, you can build a fat jar from your project with the missing 
> dependency. Flink will then automatically distribute the jar file upon job 
> submission. Just add this Maven dependency to your fat-jar pom:
> 
> 
> org.apache.hbase
> hbase-protocol
> 1.1.2
> 
> 
> Let me know if any of the two approaches work for you. After all, this is a 
> workaround because of an HBase optimzation..
> 
> Cheers,
> Max
> 
> 
> 
>> On Wed, Sep 23, 2015 at 11:16 AM, Aljoscha Krettek  
>> wrote:
>> It might me that this is causing the problem: 
>> https://issues.apache.org/jira/browse/HBASE-10304
>> 
>> In your log I see the same exception. Anyone has any idea what we could do 
>> about this?
>> 
>> 
>>> On Tue, 22 Sep 2015 at 22:40 Lydia Ickler  wrote:
>>> Hi, 
>>> 
>>> I am trying to get the HBaseReadExample to run. I have filled a table with 
>>> the HBaseWriteExample and purposely split it over 3 regions.
>>> Now when I try to read from it the first split seems to be scanned (170 
>>> rows) fine and after that the Connections of Zookeeper and RCP are suddenly 
>>> closed down.
>>> 
>>> Does anyone has an idea why this is happening?
>>> 
>>> Best regards,
>>> Lydia 
>>> 
>>> 
>>> 22:28:10,178 DEBUG org.apache.flink.runtime.operators.DataSourceTask - 
>>> Opening input split Locatable Split (2) at [grips5:60020]:  DataSource (at 
>>> createInput(ExecutionEnvironment.java:502) 
>>> (org.apache.flink.HBaseReadExample$1)) (1/1)
>>> 22:28:10,178 INFO  org.apache.flink.addons.hbase.TableInputFormat   
>>>  - opening split [2|[grips5:60020]||-]
>>> 22:28:10,189 DEBUG org.apache.zookeeper.ClientCnxn  
>>>  - Reading reply sessionid:0x24ff6a96ecd000a, packet:: clientPath:null 
>>> serverPath:null finished:false header:: 3,4  replyHeader:: 3,51539607639,0  
>>> request:: '/hbase/meta-region-server,F  response:: 
>>> #0001a726567696f6e7365727665723a363030$
>>> 22:28:10,202 DEBUG org.apache.zookeeper.ClientCnxn  
>>>  - Reading reply sessionid:0x24ff6a96ecd000a, packet:: clientPath:null 
>>> serverPath:null finished:false header:: 4,4  replyHeader:: 4,51539607639,0  
>>> request:: '/hbase/meta-region-server,F  response:: 
>>> #0001a726567696f6e7365727665723a363030$
>>> 22:28:10,211 DEBUG LocalActorRefProvider(akka://flink)  
>>>  - resolve of path sequence [/temp/$b] failed
>>> 22:28:10,233 DEBUG org.apache.hadoop.hbase.util.ByteStringer
>>>  - Failed to classload HBaseZeroCopyByteString: 
>>> java.lang.IllegalAccessError: class 
>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
>>> com.google.protobuf.LiteralByteString
>>> 22:28:10,358 DEBUG org.apache.hadoop.ipc.RpcClient  
>>>  - Use SIMPLE authentication for service ClientService, sasl=false
>>> 22:28:10,370 DEBUG org.apache.hadoop.ipc.RpcClient  
>>>  - Connecting to grips1/130.73.20.14:60020
>>> 22:28:10,380 DEBUG org.apache.hadoop.ipc.RpcClient  
>>>  - IPC Client (2145423150) connection to grips1/130.73.20.14:60020 from 
>>> hduser: starting, connections 1
>>> 22:28:10,394 DEBUG org.apache.hadoop.ipc.RpcClient  
>>>  - IPC Client (2145423150) connection to grips1/130.73.20.14:60020 from 
>>> hduser: got response header call_id: 0, totalSize: 469 bytes
>>> 22:28:10,397 DEBUG org.apache.hadoop.ipc.RpcClient  
>>>  - IPC Client (2145423150) connection to grips1/130.73.20.14:60020 f

Re: HBase issue

2015-09-24 Thread Lydia Ickler
Hi I tried that but unfortunately it still gets stuck at the second split.

Can it be that I have set something in my configurations wrong? In Hadoop? Or 
Flink?

The strange thing is that the HBaseWriteExample works great!

Best regards,
Lydia


> Am 23.09.2015 um 17:40 schrieb Maximilian Michels :
> 
> In the issue, it states that it should be sufficient to append the 
> hbase-protocol.jar file to the Hadoop classpath. Flink respects the Hadoop 
> classpath and will append it to its own classpath upon launching a cluster.
> 
> To do that, you need to modify the classpath with one of the commands below. 
> Note that this has to be performed on all cluster nodes.
> 
> export HADOOP_CLASSPATH="${HADOOP_CLASSPATH}:/path/to/hbase-protocol.jar"
> export HADOOP_CLASSPATH="${HADOOP_CLASSPATH}:$(hbase mapredcp)"
> export HADOOP_CLASSPATH="${HADOOP_CLASSPATH}:$(hbase classpath)"
> 
> Alternatively, you can build a fat jar from your project with the missing 
> dependency. Flink will then automatically distribute the jar file upon job 
> submission. Just add this Maven dependency to your fat-jar pom:
> 
> 
> org.apache.hbase
> hbase-protocol
> 1.1.2
> 
> 
> Let me know if any of the two approaches work for you. After all, this is a 
> workaround because of an HBase optimzation..
> 
> Cheers,
> Max
> 
> 
> 
> On Wed, Sep 23, 2015 at 11:16 AM, Aljoscha Krettek  <mailto:aljos...@apache.org>> wrote:
> It might me that this is causing the problem: 
> https://issues.apache.org/jira/browse/HBASE-10304 
> <https://issues.apache.org/jira/browse/HBASE-10304>
> 
> In your log I see the same exception. Anyone has any idea what we could do 
> about this?
> 
> 
> On Tue, 22 Sep 2015 at 22:40 Lydia Ickler  <mailto:ickle...@googlemail.com>> wrote:
> Hi, 
> 
> I am trying to get the HBaseReadExample to run. I have filled a table with 
> the HBaseWriteExample and purposely split it over 3 regions.
> Now when I try to read from it the first split seems to be scanned (170 rows) 
> fine and after that the Connections of Zookeeper and RCP are suddenly closed 
> down.
> 
> Does anyone has an idea why this is happening?
> 
> Best regards,
> Lydia 
> 
> 
> 22:28:10,178 DEBUG org.apache.flink.runtime.operators.DataSourceTask  
>- Opening input split Locatable Split (2) at [grips5:60020]:  DataSource 
> (at createInput(ExecutionEnvironment.java:502) 
> (org.apache.flink.HBaseReadExample$1)) (1/1)
> 22:28:10,178 INFO  org.apache.flink.addons.hbase.TableInputFormat 
>- opening split [2|[grips5:60020]||-]
> 22:28:10,189 DEBUG org.apache.zookeeper.ClientCnxn
>- Reading reply sessionid:0x24ff6a96ecd000a, packet:: clientPath:null 
> serverPath:null finished:false header:: 3,4  replyHeader:: 3,51539607639,0  
> request:: '/hbase/meta-region-server,F  response:: 
> #0001a726567696f6e7365727665723a363030$
> 22:28:10,202 DEBUG org.apache.zookeeper.ClientCnxn
>- Reading reply sessionid:0x24ff6a96ecd000a, packet:: clientPath:null 
> serverPath:null finished:false header:: 4,4  replyHeader:: 4,51539607639,0  
> request:: '/hbase/meta-region-server,F  response:: 
> #0001a726567696f6e7365727665723a363030$
> 22:28:10,211 DEBUG LocalActorRefProvider(akka://flink <>) 
>   - resolve of path sequence [/temp/$b] failed
> 22:28:10,233 DEBUG org.apache.hadoop.hbase.util.ByteStringer  
>- Failed to classload HBaseZeroCopyByteString: 
> java.lang.IllegalAccessError: class 
> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
> com.google.protobuf.LiteralByteString
> 22:28:10,358 DEBUG org.apache.hadoop.ipc.RpcClient
>- Use SIMPLE authentication for service ClientService, sasl=false
> 22:28:10,370 DEBUG org.apache.hadoop.ipc.RpcClient
>- Connecting to grips1/130.73.20.14:60020 <http://130.73.20.14:60020/>
> 22:28:10,380 DEBUG org.apache.hadoop.ipc.RpcClient
>- IPC Client (2145423150 ) connection to 
> grips1/130.73.20.14:60020 <http://130.73.20.14:60020/> from hduser: starting, 
> connections 1
> 22:28:10,394 DEBUG org.apache.hadoop.ipc.RpcClient
>- IPC Client (2145423150 ) connection to 
> grips1/130.73.20.14:60020 <http://130.73.20.14:60020/> from hduser: got 
> response header call_id: 0, totalSize: 469 bytes
> 22:28:10,397 DEBUG org.apache.hadoop.ipc.RpcClient
>- IPC Client (2145423150 ) connection to 
> grips1/130.73.20.14:60020 <http://130.73.20.14:6

Re: no valid hadoop home directory can be found

2015-09-23 Thread Lydia Ickler
Added the -Dhadoop.home.dir= to flink-conf.yaml but still get the same error.
Could it be that it somehow has to be added to HBase directly?


> Am 23.09.2015 um 13:27 schrieb Ufuk Celebi :
> 
> Either specify the env variable HADOOP_HOME or set the JVM property 
> ‘hadoop.home.dir’ (-Dhadoop.home.dir=…)
> 
> – Ufuk
> 
>> On 23 Sep 2015, at 12:43, Lydia Ickler  wrote:
>> 
>> Hi all,
>> 
>> I get the following error message that no valid hadoop home directory can be 
>> found when trying to initialize the HBase configuration.
>> Where would I specify that path?
>> 
>> 12:41:02,043 INFO  org.apache.flink.addons.hbase.TableInputFormat
>> - Initializing HBaseConfiguration
>> 12:41:02,174 DEBUG org.apache.hadoop.util.Shell  
>> - Failed to detect a valid hadoop home directory
>> java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
>>at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:225)
>>at org.apache.hadoop.util.Shell.(Shell.java:250)
>>at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>>at 
>> org.apache.hadoop.conf.Configuration.getStrings(Configuration.java:1514)
>>at 
>> org.apache.hadoop.hbase.zookeeper.ZKConfig.makeZKProps(ZKConfig.java:113)
>>at 
>> org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:259)
>>at 
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.(ZooKeeperWatcher.java:159)
>>at 
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.(ZooKeeperWatcher.java:134)
>>at 
>> org.apache.hadoop.hbase.client.ZooKeeperKeepAliveConnection.(ZooKeeperKeepAliveConnection.java:43)
>>at 
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveZooKeeperWatcher(HConnectionManager.java:1816)
>>at 
>> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:82)
>>at 
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:907)
>>at 
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.(HConnectionManager.java:701)
>>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
>> Method)
>>at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>at 
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
>>at 
>> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:457)
>>at 
>> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:436)
>>at 
>> org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:317)
>>at org.apache.hadoop.hbase.client.HTable.(HTable.java:198)
>>at org.apache.hadoop.hbase.client.HTable.(HTable.java:160)
>>at 
>> org.apache.flink.addons.hbase.TableInputFormat.createTable(TableInputFormat.java:89)
>>at 
>> org.apache.flink.addons.hbase.TableInputFormat.configure(TableInputFormat.java:78)
>>at 
>> org.apache.flink.runtime.operators.DataSourceTask.initInputFormat(DataSourceTask.java:273)
>>at 
>> org.apache.flink.runtime.operators.DataSourceTask.registerInputOutput(DataSourceTask.java:79)
>>at org.apache.flink.runtime.taskmanager.Task.run(Task.java:501)
>>at java.lang.Thread.run(Thread.java:745)
>> 
> 



no valid hadoop home directory can be found

2015-09-23 Thread Lydia Ickler
Hi all,

I get the following error message that no valid hadoop home directory can be 
found when trying to initialize the HBase configuration.
Where would I specify that path?

12:41:02,043 INFO  org.apache.flink.addons.hbase.TableInputFormat   
 - Initializing HBaseConfiguration
12:41:02,174 DEBUG org.apache.hadoop.util.Shell 
 - Failed to detect a valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:225)
at org.apache.hadoop.util.Shell.(Shell.java:250)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
at 
org.apache.hadoop.conf.Configuration.getStrings(Configuration.java:1514)
at 
org.apache.hadoop.hbase.zookeeper.ZKConfig.makeZKProps(ZKConfig.java:113)
at 
org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:259)
at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.(ZooKeeperWatcher.java:159)
at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.(ZooKeeperWatcher.java:134)
at 
org.apache.hadoop.hbase.client.ZooKeeperKeepAliveConnection.(ZooKeeperKeepAliveConnection.java:43)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveZooKeeperWatcher(HConnectionManager.java:1816)
at 
org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:82)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:907)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.(HConnectionManager.java:701)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at 
org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:457)
at 
org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:436)
at 
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:317)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:198)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:160)
at 
org.apache.flink.addons.hbase.TableInputFormat.createTable(TableInputFormat.java:89)
at 
org.apache.flink.addons.hbase.TableInputFormat.configure(TableInputFormat.java:78)
at 
org.apache.flink.runtime.operators.DataSourceTask.initInputFormat(DataSourceTask.java:273)
at 
org.apache.flink.runtime.operators.DataSourceTask.registerInputOutput(DataSourceTask.java:79)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:501)
at java.lang.Thread.run(Thread.java:745)



HBase issue

2015-09-22 Thread Lydia Ickler
Hi, 

I am trying to get the HBaseReadExample to run. I have filled a table with the 
HBaseWriteExample and purposely split it over 3 regions.
Now when I try to read from it the first split seems to be scanned (170 rows) 
fine and after that the Connections of Zookeeper and RCP are suddenly closed 
down.

Does anyone has an idea why this is happening?

Best regards,
Lydia 


22:28:10,178 DEBUG org.apache.flink.runtime.operators.DataSourceTask
 - Opening input split Locatable Split (2) at [grips5:60020]:  DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.HBaseReadExample$1)) (1/1)
22:28:10,178 INFO  org.apache.flink.addons.hbase.TableInputFormat   
 - opening split [2|[grips5:60020]||-]
22:28:10,189 DEBUG org.apache.zookeeper.ClientCnxn  
 - Reading reply sessionid:0x24ff6a96ecd000a, packet:: clientPath:null 
serverPath:null finished:false header:: 3,4  replyHeader:: 3,51539607639,0  
request:: '/hbase/meta-region-server,F  response:: 
#0001a726567696f6e7365727665723a363030$
22:28:10,202 DEBUG org.apache.zookeeper.ClientCnxn  
 - Reading reply sessionid:0x24ff6a96ecd000a, packet:: clientPath:null 
serverPath:null finished:false header:: 4,4  replyHeader:: 4,51539607639,0  
request:: '/hbase/meta-region-server,F  response:: 
#0001a726567696f6e7365727665723a363030$
22:28:10,211 DEBUG LocalActorRefProvider(akka://flink)  
 - resolve of path sequence [/temp/$b] failed
22:28:10,233 DEBUG org.apache.hadoop.hbase.util.ByteStringer
 - Failed to classload HBaseZeroCopyByteString: java.lang.IllegalAccessError: 
class com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
com.google.protobuf.LiteralByteString
22:28:10,358 DEBUG org.apache.hadoop.ipc.RpcClient  
 - Use SIMPLE authentication for service ClientService, sasl=false
22:28:10,370 DEBUG org.apache.hadoop.ipc.RpcClient  
 - Connecting to grips1/130.73.20.14:60020
22:28:10,380 DEBUG org.apache.hadoop.ipc.RpcClient  
 - IPC Client (2145423150) connection to grips1/130.73.20.14:60020 from hduser: 
starting, connections 1
22:28:10,394 DEBUG org.apache.hadoop.ipc.RpcClient  
 - IPC Client (2145423150) connection to grips1/130.73.20.14:60020 from hduser: 
got response header call_id: 0, totalSize: 469 bytes
22:28:10,397 DEBUG org.apache.hadoop.ipc.RpcClient  
 - IPC Client (2145423150) connection to grips1/130.73.20.14:60020 from hduser: 
wrote request header call_id: 0 method_name: "Get" request_param: true
22:28:10,413 DEBUG org.apache.zookeeper.ClientCnxn  
 - Reading reply sessionid:0x24ff6a96ecd000a, packet:: clientPath:null 
serverPath:null finished:false header:: 5,4  replyHeader:: 5,51539607639,0  
request:: '/hbase/meta-region-server,F  response:: 
#0001a726567696f6e7365727665723a363030$
22:28:10,424 DEBUG org.apache.hadoop.ipc.RpcClient  
 - IPC Client (2145423150) connection to grips1/130.73.20.14:60020 from hduser: 
wrote request header call_id: 1 method_name: "Scan" request_param: true 
priority: 100
22:28:10,426 DEBUG org.apache.hadoop.ipc.RpcClient  
 - IPC Client (2145423150) connection to grips1/130.73.20.14:60020 from hduser: 
got response header call_id: 1 cell_block_meta { length: 480 }, totalSize: 497 
bytes
22:28:10,432 DEBUG org.apache.hadoop.hbase.client.ClientSmallScanner
 - Finished with small scan at {ENCODED => 1588230740, NAME => 'hbase:meta,,1', 
STARTKEY => '', ENDKEY => ''}
22:28:10,434 DEBUG org.apache.hadoop.ipc.RpcClient  
 - Use SIMPLE authentication for service ClientService, sasl=false
22:28:10,434 DEBUG org.apache.hadoop.ipc.RpcClient  
 - Connecting to grips5/130.73.20.16:60020
22:28:10,435 DEBUG org.apache.hadoop.ipc.RpcClient  
 - IPC Client (2145423150) connection to grips5/130.73.20.16:60020 from hduser: 
wrote request header call_id: 2 method_name: "Scan" request_param: true
22:28:10,436 DEBUG org.apache.hadoop.ipc.RpcClient  
 - IPC Client (2145423150) connection to grips5/130.73.20.16:60020 from hduser: 
starting, connections 2
22:28:10,437 DEBUG org.apache.hadoop.ipc.RpcClient  
 - IPC Client (2145423150) connection to grips5/130.73.20.16:60020 from hduser: 
got response header call_id: 2, totalSize: 12 bytes
22:28:10,438 DEBUG org.apache.flink.runtime.operators.DataSourceTask
 - Starting to read input from split Locatable Split (2) at [grips5:60020]:  
DataSource (at createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.HBaseReadExample$1)) (1/1)
22:28:10,438 DEBUG org.apache.hadoop.ipc.RpcClient  
 - I

Re: Job stuck at "Assigning split to host..."

2015-07-27 Thread Lydia Ickler
Hi Ufuk, 

yes, I figured out that the HMaster of hbase did not start properly!
Now everything is working :)

Thanks for your help!

Best regards,
Lydia


> Am 27.07.2015 um 11:45 schrieb Ufuk Celebi :
> 
> Any update on this Lydia?
> 
> On 23 Jul 2015, at 16:38, Ufuk Celebi  wrote:
> 
>> Unfortunately we don't gain more insight from the DEBUG logs... it looks 
>> like the connection is just lost. Did you have a look the *.out files as 
>> well? Maybe something was printed to sysout.



Re: Job stuck at "Assigning split to host..."

2015-07-23 Thread Lydia Ickler
Hi Ufuk,

no, I don’t mind!
Where would I change the log level?

Best regards,
Lydia

> Am 23.07.2015 um 14:41 schrieb Ufuk Celebi :
> 
> Hey Lydia,
> 
> it looks like the HBase client is losing its connection to HBase. Before 
> that, everything seems to be working just fine (X rows are read etc.).
> 
> Do you mind setting the log level to DEBUG and then posting the logs again?
> 
> – Ufuk
> 
> On 23 Jul 2015, at 14:12, Lydia Ickler  wrote:
> 
>> Hi,
>> 
>> I am trying to read data from a HBase Table via the HBaseReadExample.java
>> Unfortunately, my run gets always stuck at the same position.
>> Do you guys have any suggestions?
>> 
>> In the master node it says:
>> 14:05:04,239 INFO  org.apache.flink.runtime.jobmanager.JobManager
>> - Received job bb9560efb8117ce7e840bea2c4b967c1 (Flink Java Job at Thu 
>> Jul 23 14:04:57 CEST 2015).
>> 14:05:04,268 INFO  org.apache.flink.addons.hbase.TableInputFormat
>> - Initializing HBaseConfiguration
>> 14:05:04,346 INFO  org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper
>> - Process identifier=hconnection-0x6704d3b4 connecting to ZooKeeper 
>> ensemble=localhost:2181
>> 14:05:04,347 INFO  org.apache.zookeeper.ZooKeeper
>> - Initiating client connection, connectString=localhost:2181 
>> sessionTimeout=9 watcher=hconnection-0x6704d3b40x0, 
>> quorum=localhost:2181, baseZNode=/hbase
>> 14:05:04,352 INFO  org.apache.zookeeper.ClientCnxn   
>> - Opening socket connection to server 
>> 127.0.0.1/127.0.0.1:2181
>> . Will not attempt to authenticate using SASL (unknown error)
>> 14:05:04,353 INFO  org.apache.zookeeper.ClientCnxn   
>> - Socket connection established to 
>> 127.0.0.1/127.0.0.1:2181
>> , initiating session
>> 14:05:04,376 INFO  org.apache.zookeeper.ClientCnxn   
>> - Session establishment complete on server 
>> 127.0.0.1/127.0.0.1:2181
>> , sessionid = 0x24ebaaf7d06000a, negotiated timeout = 4
>> 14:05:04,637 INFO  org.apache.flink.addons.hbase.TableInputFormat
>> - Created 4 splits
>> 14:05:04,637 INFO  org.apache.flink.addons.hbase.TableInputFormat
>> - created split [0|[grips5:16020]|-|LUAD+5781]
>> 14:05:04,637 INFO  org.apache.flink.addons.hbase.TableInputFormat
>> - created split [1|[grips1:16020]|LUAD+5781|LUAD+7539]
>> 14:05:04,637 INFO  org.apache.flink.addons.hbase.TableInputFormat
>> - created split [2|[grips1:16020]|LUAD+7539|LUAD+8552]
>> 14:05:04,637 INFO  org.apache.flink.addons.hbase.TableInputFormat
>> - created split [3|[grips1:16020]|LUAD+8552|-]
>> 14:05:04,641 INFO  org.apache.flink.runtime.jobmanager.JobManager
>> - Scheduling job Flink Java Job at Thu Jul 23 14:04:57 CEST 2015.
>> 14:05:04,642 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>> - CHAIN DataSource (at createInput(ExecutionEnvironment.java:502) 
>> (org.apache.flink.addons.hbase.HBaseReadExample$1)) -> FlatMap (collect()) 
>> (1/1) (94de8700cefe1651558e25c98829a156) switched from CREATED to SCHEDULED
>> 14:05:04,643 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>> - CHAIN DataSource (at createInput(ExecutionEnvironment.java:502) 
>> (org.apache.flink.addons.hbase.HBaseReadExample$1)) -> FlatMap (collect()) 
>> (1/1) (94de8700cefe1651558e25c98829a156) switched from SCHEDULED to DEPLOYING
>> 14:05:04,643 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>> - Deploying CHAIN DataSource (at 
>> createInput(ExecutionEnvironment.java:502) 
>> (org.apache.flink.addons.hbase.HBaseReadExample$1)) -> FlatMap (collect()) 
>> (1/1) (attempt #0) to grips4
>> 14:05:04,647 INFO  org.apache.flink.runtime.jobmanager.JobManager
>> - Status of job bb9560efb8117ce7e840bea2c4b967c1 (Flink Java Job at Thu 
>> Jul 23 14:04:57 CEST 2015) changed to RUNNING.
>> 14:05:07,537 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>> - CHAIN DataSource (at createInput(ExecutionEnvironment.java:502) 
>> (org.apache.flink.addons.hbase.HBaseReadExample$1)) -> FlatMap (collect()) 
>> (1/1) (94de8700cefe1651558e25c98829a156) switched from DEPLOYING to RUNNING
>> 14:05:07,545 INFO  
>> org.apache.flink.api.common.io.LocatableInputSplitAssigner- Assigning 
>> remote split to host grips4
>> 14:05:08,338 INFO  
>> org.apache.flink.api.common.io.LocatableInputSplitAssig

Job stuck at "Assigning split to host..."

2015-07-23 Thread Lydia Ickler
Hi,

I am trying to read data from a HBase Table via the HBaseReadExample.java
Unfortunately, my run gets always stuck at the same position.
Do you guys have any suggestions?

In the master node it says:

14:05:04,239 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Received job bb9560efb8117ce7e840bea2c4b967c1 (Flink Java
Job at Thu Jul 23 14:04:57 CEST 2015).
14:05:04,268 INFO  org.apache.flink.addons.hbase.TableInputFormat
  - Initializing HBaseConfiguration
14:05:04,346 INFO
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper-
Process identifier=hconnection-0x6704d3b4 connecting to ZooKeeper
ensemble=localhost:2181
14:05:04,347 INFO  org.apache.zookeeper.ZooKeeper
  - Initiating client connection, connectString=localhost:2181
sessionTimeout=9 watcher=hconnection-0x6704d3b40x0,
quorum=localhost:2181, baseZNode=/hbase
14:05:04,352 INFO  org.apache.zookeeper.ClientCnxn
  - Opening socket connection to server
127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL
(unknown error)
14:05:04,353 INFO  org.apache.zookeeper.ClientCnxn
  - Socket connection established to 127.0.0.1/127.0.0.1:2181,
initiating session
14:05:04,376 INFO  org.apache.zookeeper.ClientCnxn
  - Session establishment complete on server
127.0.0.1/127.0.0.1:2181, sessionid = 0x24ebaaf7d06000a, negotiated
timeout = 4
14:05:04,637 INFO  org.apache.flink.addons.hbase.TableInputFormat
  - Created 4 splits
14:05:04,637 INFO  org.apache.flink.addons.hbase.TableInputFormat
  - created split [0|[grips5:16020]|-|LUAD+5781]
14:05:04,637 INFO  org.apache.flink.addons.hbase.TableInputFormat
  - created split [1|[grips1:16020]|LUAD+5781|LUAD+7539]
14:05:04,637 INFO  org.apache.flink.addons.hbase.TableInputFormat
  - created split [2|[grips1:16020]|LUAD+7539|LUAD+8552]
14:05:04,637 INFO  org.apache.flink.addons.hbase.TableInputFormat
  - created split [3|[grips1:16020]|LUAD+8552|-]
14:05:04,641 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Scheduling job Flink Java Job at Thu Jul 23 14:04:57 CEST
2015.
14:05:04,642 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph- CHAIN
DataSource (at createInput(ExecutionEnvironment.java:502)
(org.apache.flink.addons.hbase.HBaseReadExample$1)) -> FlatMap
(collect()) (1/1) (94de8700cefe1651558e25c98829a156) switched from
CREATED to SCHEDULED
14:05:04,643 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph- CHAIN
DataSource (at createInput(ExecutionEnvironment.java:502)
(org.apache.flink.addons.hbase.HBaseReadExample$1)) -> FlatMap
(collect()) (1/1) (94de8700cefe1651558e25c98829a156) switched from
SCHEDULED to DEPLOYING
14:05:04,643 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph-
Deploying CHAIN DataSource (at
createInput(ExecutionEnvironment.java:502)
(org.apache.flink.addons.hbase.HBaseReadExample$1)) -> FlatMap
(collect()) (1/1) (attempt #0) to grips4
14:05:04,647 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Status of job bb9560efb8117ce7e840bea2c4b967c1 (Flink Java
Job at Thu Jul 23 14:04:57 CEST 2015) changed to RUNNING.
14:05:07,537 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph- CHAIN
DataSource (at createInput(ExecutionEnvironment.java:502)
(org.apache.flink.addons.hbase.HBaseReadExample$1)) -> FlatMap
(collect()) (1/1) (94de8700cefe1651558e25c98829a156) switched from
DEPLOYING to RUNNING
14:05:07,545 INFO
org.apache.flink.api.common.io.LocatableInputSplitAssigner-
Assigning remote split to host grips4
14:05:08,338 INFO
org.apache.flink.api.common.io.LocatableInputSplitAssigner-
Assigning remote split to host grips4


And in node "grips4":

07,273 INFO  org.apache.zookeeper.ZooKeeper
- Initiating client connection, connectString=localhost:2181
sessionTi$
14:05:07,296 INFO  org.apache.zookeeper.ClientCnxn
  - Opening socket connection to server
127.0.0.1/127.0.0.1:2181. Will n$
14:05:07,300 INFO  org.apache.zookeeper.ClientCnxn
  - Socket connection established to 127.0.0.1/127.0.0.1:2181,
initiatin$
14:05:07,332 INFO  org.apache.zookeeper.ClientCnxn
  - Session establishment complete on server
127.0.0.1/127.0.0.1:2181, s$
14:05:07,531 INFO  org.apache.flink.runtime.taskmanager.Task
  - CHAIN DataSource (at
createInput(ExecutionEnvironment.java:502) (org$
14:05:07,970 INFO  org.apache.flink.addons.hbase.TableInputFormat
  - opening split [3|[grips1:16020]|LUAD+8552|-]
14:05:08,223 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation
 - Closing zookeeper sessionid=0x44ebaaf7d35000e
14:05:08,235 INFO  org.apache.zookeeper.ZooKeeper
  - Session: 0x44ebaaf7d35000e closed
14:05:08,235 INFO  org.apache.zookeeper.ClientCnxn
  - EventThread shut down
14:05:08,337 INFO  org.apache.flink.addons.hbase.TableInputFormat
  - Closing split (scanned 129 rows)
14:05:08,343 INFO  org.apach

Re: HBase on 4 machine cluster - OutOfMemoryError

2015-07-18 Thread Lydia Ickler
Hi,

yes, it is in one row. Each row represents a patient that has values of 20.000 
different genes stored in one column family and one value of health status in a 
second column family.


> Am 18.07.2015 um 15:38 schrieb Stephan Ewen :
> 
> This error is in the HBase RPC Service. Apparently the RPC message is very 
> large.
> 
> Is the data that you request in one row?
> 
> Am 18.07.2015 00:50 schrieb "Lydia Ickler"  <mailto:ickle...@googlemail.com>>:
> Hi all,
> 
> I am trying to read a data set from HBase within a cluster application. 
> The data is about 90MB big.
> 
> When I run the program on a cluster consisting of 4 machines (8GB RAM) I get 
> the following error on the head-node:
> 
> 16:57:41,572 INFO  org.apache.flink.api.common.io.LocatableInputSplitAssigner 
>- Assigning remote split to host grips5
> 17:17:26,127 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph 
>- DataSource (at createInput(ExecutionEnvironment.java:502) 
> (org.apache.flink.addons.hbase.HBaseR$
> 17:17:26,128 INFO  org.apache.flink.runtime.jobmanager.JobManager 
>- Status of job b768ff76167fa3ea3e4cb3cc3481ba80 (Labeled - ML) changed to 
> FAILING.
> 
> And within the machine grips5:
> 16:57:23,769 INFO  org.apache.flink.addons.hbase.TableInputFormat 
>- opening split [1|[grips1:16020]|LUAD+5781|LUAD+7539]
> 16:57:33,734 WARN  org.apache.hadoop.ipc.RpcClient
>- IPC Client (767445418) connection to grips1/130.73.20.14:16020 
> <http://130.73.20.14:16020/> from hduser: unexpected exceptio$
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1117)
> at 
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)
> 16:57:39,969 WARN  org.apache.hadoop.ipc.RpcClient
>- IPC Client (767445418) connection to grips1/130.73.20.14:16020 
> <http://130.73.20.14:16020/> from hduser: unexpected exceptio$
> java.lang.OutOfMemoryError: Java heap space
> 
> and then it just closes the zookeeper…
> 
> Do you have a suggestion how to avoid this OutOfMemoryError?
> Best regards,
> Lydia
> 
> 
> 



HBase on 4 machine cluster - OutOfMemoryError

2015-07-17 Thread Lydia Ickler
Hi all,

I am trying to read a data set from HBase within a cluster application. 
The data is about 90MB big.

When I run the program on a cluster consisting of 4 machines (8GB RAM) I get 
the following error on the head-node:

16:57:41,572 INFO  org.apache.flink.api.common.io.LocatableInputSplitAssigner   
 - Assigning remote split to host grips5
17:17:26,127 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - DataSource (at createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.addons.hbase.HBaseR$
17:17:26,128 INFO  org.apache.flink.runtime.jobmanager.JobManager   
 - Status of job b768ff76167fa3ea3e4cb3cc3481ba80 (Labeled - ML) changed to 
FAILING.

And within the machine grips5:
16:57:23,769 INFO  org.apache.flink.addons.hbase.TableInputFormat   
 - opening split [1|[grips1:16020]|LUAD+5781|LUAD+7539]
16:57:33,734 WARN  org.apache.hadoop.ipc.RpcClient  
 - IPC Client (767445418) connection to grips1/130.73.20.14:16020 from hduser: 
unexpected exceptio$
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1117)
at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)
16:57:39,969 WARN  org.apache.hadoop.ipc.RpcClient  
 - IPC Client (767445418) connection to grips1/130.73.20.14:16020 from hduser: 
unexpected exceptio$
java.lang.OutOfMemoryError: Java heap space

and then it just closes the zookeeper…

Do you have a suggestion how to avoid this OutOfMemoryError?
Best regards,
Lydia





DataSet Conversion

2015-07-13 Thread Lydia Ickler
Hi guys,

is it possible to convert a Java DataSet to a Scala Dataset?
Right now I get the following error:
Error:(102, 29) java: incompatible types: 
'org.apache.flink.api.java.DataSet cannot be converted to 
org.apache.flink.api.scala.DataSet‘

Thanks in advance,
Lydia

HBase & Machine Learning

2015-07-11 Thread Lydia Ickler
Dear Sir or Madame,

I would like to use the Flink-HBase addon to read out data that then serves as 
an input for the machine learning algorithms, respectively the SVM and MLR. 
Right now I first write the extracted data to a temporary file and then read it 
in via the libSVM method...but i guess there should Be a more sophisticated way.

Do you have a code snippet or an idea how to do so?

Many thanks in advance and best regards,
Lydia