Re: Any remote opportunity to work on Flink project?

2018-06-05 Thread Christophe Salperwyck
Still some people are interested to pay people to build a product around
Flink :-)

Interested too about Flink and online ML!

Cheers,
Christophe

On Wed, 6 Jun 2018 at 07:40, Garvit Sharma  wrote:

> Flink is OpenSource!!
>
> On Wed, Jun 6, 2018 at 10:45 AM, Deepak Sharma 
> wrote:
>
>> Hi Flink Users,
>> Sorry to spam your inbox and GM all.
>> I am looking for opportunity to work on Flink project , specifically if
>> its Flink ML over streaming
>> Please do let me know if anyone is looking for freelancers around any of
>> their Flink projects.
>>
>> --
>> Thanks
>> Deepak
>>
>
>
>
> --
>
> Garvit Sharma
> github.com/garvitlnmiit/
>
> No Body is a Scholar by birth, its only hard work and strong determination
> that makes him master.
>


Re: FlinkML

2018-04-18 Thread Christophe Salperwyck
Hi,

You could try to plug MOA/Weka library too. I did some preliminary work
with that:
https://moa.cms.waikato.ac.nz/moa-with-apache-flink/

but then it is not anymore FlinkML algorithms.

Best regards,
Christophe


2018-04-18 21:13 GMT+02:00 shashank734 :

> There are no active discussions or guide on that. But I found this example
> in
> the repo :
>
> https://github.com/apache/flink/blob/master/flink-examples/
> flink-examples-streaming/src/main/java/org/apache/flink/
> streaming/examples/ml/IncrementalLearningSkeleton.java
>  flink-examples-streaming/src/main/java/org/apache/flink/
> streaming/examples/ml/IncrementalLearningSkeleton.java>
>
> Which is trying to do the same thing. Although I haven't checked this yet.
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.
> nabble.com/
>


Re: Flink and Docker ?

2018-04-03 Thread Christophe Salperwyck
Hi,

I didn't try docker with Flink but I know that those guys did:
https://github.com/big-data-europe/docker-flink

Perhaps it is worth having a look there.

BR,
Christophe

2018-04-03 9:29 GMT+02:00 Esa Heikkinen :

> Hi
>
>
>
> I have noticed that Flink can be pretty tedious to install and build first
> applications from scratch. Especially if the application is little bit
> complex. There are also little bit different development and run time
> environments, which require different software components with correct
> versions.
>
>
>
> I found Docker could help with this problem:
>
> https://flink.apache.org/news/2017/05/16/official-docker-image.html
>
>
>
> Has anyone used Flink with Docker and what are the experiences about using
> it ?
>
>
>
> Do you recommend to use Flink with Docker ?
>
>
>
> Can there be a problem with different versions, if some software component
> is not correct or latest in the Docker image ?
>
>
>
> Best, Esa
>
>
>


Re: Machine Learning: Flink and MOA

2018-02-28 Thread Christophe Salperwyck
 Hello Theodore,

Glad to hear that there is an interest in plugging MOA with Flink!

Which part/type of classifiers of MOA would you want to plug with Flink?
Let me know if you want to discuss in more details.

I guess some windowing function of MOA would be better implemented as Flink
Windows (performance evaluation I would say at first). We would need to
speak with Albert to see how this could be handled (change some MOA code?).

Regards,
Christophe

2018-02-24 1:03 GMT+01:00 Theodore Vasiloudis <
theodoros.vasilou...@gmail.com>:

> Hello Christophe,
>
> That's very interesting, I've been working with MOA/SAMOA recently and was
> considering if we could create some
> easy integration with Flink.
>
> I have a Master student this year that could do some work on this,
> hopefully we can create something interesting
> there.
>
> Regards,
> Theodore
>
> On Wed, Feb 21, 2018 at 7:38 PM, Christophe Salperwyck <
> christophe.salperw...@gmail.com> wrote:
>
>> Hi guys,
>>
>> I know there is FlinkML to do some machine learning with Flink but it
>> works on DataSet and not on DataStream, there is also SAMOA which can run
>> on Flink but I find it a bit too complicated.
>>
>> I wanted to see if it would be easy to plug directly MOA on Flink and
>> tried to present it in the DataKRK meetup, but I didn't have time at the
>> end of the presentation... Nevertheless I spent a bit of time plugging
>> Flink and MOA and I thought it might be worth sharing it in case it would
>> be interesting for someone. I also take this opportunity to get some
>> feedback on it from people in the Flink community if they have a bit of
>> time to review it.
>>
>> Here is the code:
>> https://github.com/csalperwyck/moa-flink-ozabag-example
>> https://github.com/csalperwyck/moa-flink-traintest-example
>>
>> Many Flink methods were very convenient to plug these 2 tools :-)
>>
>> Keep the good work!
>>
>> Cheers,
>> Christophe
>> PS: if some people are in bigdatatechwarsaw and interested, we can
>> discuss tomorrow :-)
>>
>
>


Machine Learning: Flink and MOA

2018-02-21 Thread Christophe Salperwyck
Hi guys,

I know there is FlinkML to do some machine learning with Flink but it works
on DataSet and not on DataStream, there is also SAMOA which can run on
Flink but I find it a bit too complicated.

I wanted to see if it would be easy to plug directly MOA on Flink and tried
to present it in the DataKRK meetup, but I didn't have time at the end of
the presentation... Nevertheless I spent a bit of time plugging Flink and
MOA and I thought it might be worth sharing it in case it would be
interesting for someone. I also take this opportunity to get some feedback
on it from people in the Flink community if they have a bit of time to
review it.

Here is the code:
https://github.com/csalperwyck/moa-flink-ozabag-example
https://github.com/csalperwyck/moa-flink-traintest-example

Many Flink methods were very convenient to plug these 2 tools :-)

Keep the good work!

Cheers,
Christophe
PS: if some people are in bigdatatechwarsaw and interested, we can discuss
tomorrow :-)


Re: Benchmarking streaming frameworks

2017-03-23 Thread Christophe Salperwyck
Good idea! You could test Akka streams too.

Lots of documents exist:
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
https://github.com/yahoo/streaming-benchmarks

Cheers,
Christophe

2017-03-23 11:09 GMT+01:00 Giselle van Dongen :

> Dear users of Streaming Technologies,
>
> As a PhD student in big data analytics, I am currently in the process of
> compiling a list of benchmarks (to test multiple streaming frameworks) in
> order to create an expanded benchmarking suite. The benchmark suite is
> being
> developed as a part of my current work at Ghent University.
>
> The included frameworks at this time are, in no particular order, Spark,
> Flink, Kafka (Streams), Storm (Trident) and Drizzle. Any pointers to
> previous work or relevant benchmarks would be appreciated.
>
> Best regards,
> Giselle van Dongen
>


Re: HBase reads and back pressure

2016-06-14 Thread Christophe Salperwyck
I would need to restart it to be sure (and when it starts to be stuck, the
web interface doesn't give the backpressure anymore), but it seems so. I
put a text file as the output and it took 5h to complete:
aggregates.writeAsText("hdfs:///user/christophe/flinkHBase");

What is weird is that I have as many lines as in my input if I don't limit
the scan (14B rows). If I limit the scan (150M) I have 1 line per hour as
expected. I am investigating a bit more right now.

Thanks again!
Christophe

2016-06-13 18:50 GMT+02:00 Fabian Hueske :

> Do the backpressure metrics indicate that the sink function is blocking?
>
> 2016-06-13 16:58 GMT+02:00 Christophe Salperwyck <
> christophe.salperw...@gmail.com>:
>
>> To continue, I implemented the ws.apply(new SummaryStatistics(), new
>> YourFoldFunction(), new YourWindowFunction());
>>
>> It works fine when there is no sink, but when I put an HBase sink it
>> seems that the sink, somehow, blocks the flow. The sink writes very little
>> data into HBase and when I limit my input to just few sensors, it works
>> well. Any idea?
>>
>> final SingleOutputStreamOperator aggregates = ws
>> .apply(
>> new Aggregate(),
>> new FoldFunction() {
>>
>> @Override
>> public Aggregate fold(final Aggregate accumulator, final ANA value)
>> throws Exception {
>> accumulator.addValue(value.getValue());
>> return accumulator;
>> }
>> },
>> new WindowFunction() {
>>
>> @Override
>> public void apply(final Tuple key, final TimeWindow window, final
>> Iterable input,
>> final Collector out) throws Exception {
>> for (final Aggregate aggregate : input) {
>> aggregate.setM((String) key.getField(0));
>> aggregate.setTime(window.getStart());
>> out.collect(aggregate);
>> }
>> }
>> });
>> aggregates.
>> setParallelism(10).
>> writeUsingOutputFormat(new OutputFormat() {
>> private static final long serialVersionUID = 1L;
>> HBaseConnect hBaseConnect;
>> Table table;
>> final int flushSize = 100;
>> List puts = new ArrayList<>();
>> @Override
>> public void writeRecord(final Aggregate record) throws IOException {
>> puts.add(record.buildPut());
>> if (puts.size() == flushSize) {
>> table.put(puts);
>> }
>> }
>> @Override
>> public void open(final int taskNumber, final int numTasks) throws
>> IOException {
>> hBaseConnect = new HBaseConnect();
>> table = hBaseConnect.getHTable("PERF_TEST");
>> }
>> @Override
>> public void configure(final Configuration parameters) {
>> // TODO Auto-generated method stub
>> }
>> @Override
>> public void close() throws IOException {
>> //last inserts
>> table.put(puts);
>> table.close();
>> hBaseConnect.close();
>> }
>> });
>>
>> 2016-06-13 13:47 GMT+02:00 Maximilian Michels :
>>
>>> Thanks!
>>>
>>> On Mon, Jun 13, 2016 at 12:34 PM, Christophe Salperwyck
>>>  wrote:
>>> > Hi,
>>> > I vote on this issue and I agree this would be nice to have.
>>> >
>>> > Thx!
>>> > Christophe
>>> >
>>> > 2016-06-13 12:26 GMT+02:00 Aljoscha Krettek :
>>> >>
>>> >> Hi,
>>> >> I'm afraid this is currently a shortcoming in the API. There is this
>>> open
>>> >> Jira issue to track it:
>>> https://issues.apache.org/jira/browse/FLINK-3869. We
>>> >> can't fix it before Flink 2.0, though, because we have to keep the API
>>> >> stable on the Flink 1.x release line.
>>> >>
>>> >> Cheers,
>>> >> Aljoscha
>>> >>
>>> >> On Mon, 13 Jun 2016 at 11:04 Christophe Salperwyck
>>> >>  wrote:
>>> >>>
>>> >>> Thanks for the feedback and sorry that I can't try all this straight
>>> >>> away.
>>> >>>
>>> >>> Is there a way to have a different function than:
>>> >>> WindowFunction>> TimeWindow>()
>>> >>>
>>> >>> I would like to return a HBase Put and not a SummaryStatistics. So
>>> >>> something like this:
>>> >>> WindowFunction()
>>> >>>
>>> >>> Christophe
>>> >>>
>>> >>> 2016-06-09 17:47 GMT+02:00 Fabian Hueske :
>>> >>>>
>>> >>>> OK, this indicates that the operator following the source is a
>>

Re: HBase reads and back pressure

2016-06-13 Thread Christophe Salperwyck
To continue, I implemented the ws.apply(new SummaryStatistics(), new
YourFoldFunction(), new YourWindowFunction());

It works fine when there is no sink, but when I put an HBase sink it seems
that the sink, somehow, blocks the flow. The sink writes very little data
into HBase and when I limit my input to just few sensors, it works well. Any
idea?

final SingleOutputStreamOperator aggregates = ws
.apply(
new Aggregate(),
new FoldFunction() {

@Override
public Aggregate fold(final Aggregate accumulator, final ANA value) throws
Exception {
accumulator.addValue(value.getValue());
return accumulator;
}
},
new WindowFunction() {

@Override
public void apply(final Tuple key, final TimeWindow window, final
Iterable input,
final Collector out) throws Exception {
for (final Aggregate aggregate : input) {
aggregate.setM((String) key.getField(0));
aggregate.setTime(window.getStart());
out.collect(aggregate);
}
}
});
aggregates.
setParallelism(10).
writeUsingOutputFormat(new OutputFormat() {
private static final long serialVersionUID = 1L;
HBaseConnect hBaseConnect;
Table table;
final int flushSize = 100;
List puts = new ArrayList<>();
@Override
public void writeRecord(final Aggregate record) throws IOException {
puts.add(record.buildPut());
if (puts.size() == flushSize) {
table.put(puts);
}
}
@Override
public void open(final int taskNumber, final int numTasks) throws
IOException {
hBaseConnect = new HBaseConnect();
table = hBaseConnect.getHTable("PERF_TEST");
}
@Override
public void configure(final Configuration parameters) {
// TODO Auto-generated method stub
}
@Override
public void close() throws IOException {
//last inserts
table.put(puts);
table.close();
hBaseConnect.close();
}
});

2016-06-13 13:47 GMT+02:00 Maximilian Michels :

> Thanks!
>
> On Mon, Jun 13, 2016 at 12:34 PM, Christophe Salperwyck
>  wrote:
> > Hi,
> > I vote on this issue and I agree this would be nice to have.
> >
> > Thx!
> > Christophe
> >
> > 2016-06-13 12:26 GMT+02:00 Aljoscha Krettek :
> >>
> >> Hi,
> >> I'm afraid this is currently a shortcoming in the API. There is this
> open
> >> Jira issue to track it:
> https://issues.apache.org/jira/browse/FLINK-3869. We
> >> can't fix it before Flink 2.0, though, because we have to keep the API
> >> stable on the Flink 1.x release line.
> >>
> >> Cheers,
> >> Aljoscha
> >>
> >> On Mon, 13 Jun 2016 at 11:04 Christophe Salperwyck
> >>  wrote:
> >>>
> >>> Thanks for the feedback and sorry that I can't try all this straight
> >>> away.
> >>>
> >>> Is there a way to have a different function than:
> >>> WindowFunction TimeWindow>()
> >>>
> >>> I would like to return a HBase Put and not a SummaryStatistics. So
> >>> something like this:
> >>> WindowFunction()
> >>>
> >>> Christophe
> >>>
> >>> 2016-06-09 17:47 GMT+02:00 Fabian Hueske :
> >>>>
> >>>> OK, this indicates that the operator following the source is a
> >>>> bottleneck.
> >>>>
> >>>> If that's the WindowOperator, it makes sense to try the refactoring of
> >>>> the WindowFunction.
> >>>> Alternatively, you can try to run that operator with a higher
> >>>> parallelism.
> >>>>
> >>>> 2016-06-09 17:39 GMT+02:00 Christophe Salperwyck
> >>>> :
> >>>>>
> >>>>> Hi Fabian,
> >>>>>
> >>>>> Thanks for the help, I will try that. The backpressure was on the
> >>>>> source (HBase).
> >>>>>
> >>>>> Christophe
> >>>>>
> >>>>> 2016-06-09 16:38 GMT+02:00 Fabian Hueske :
> >>>>>>
> >>>>>> Hi Christophe,
> >>>>>>
> >>>>>> where does the backpressure appear? In front of the sink operator or
> >>>>>> before the window operator?
> >>>>>>
> >>>>>> In any case, I think you can improve your WindowFunction if you
> >>>>>> convert parts of it into a FoldFunction.
> >>>>>> The FoldFunction would take care of the statistics computation and
> the
> >>>>>> WindowFunction would only assemble the result record including
> extracting
> >>>>>> the start time of the window.
> >>>>>>
> >>>>>> Then you could do:
> >>>>>>
> >>>>>> ws.apply(new SummaryStatistics(), new Y

Re: HBase reads and back pressure

2016-06-13 Thread Christophe Salperwyck
Hi,
I vote on this issue and I agree this would be nice to have.

Thx!
Christophe

2016-06-13 12:26 GMT+02:00 Aljoscha Krettek :

> Hi,
> I'm afraid this is currently a shortcoming in the API. There is this open
> Jira issue to track it: https://issues.apache.org/jira/browse/FLINK-3869.
> We can't fix it before Flink 2.0, though, because we have to keep the API
> stable on the Flink 1.x release line.
>
> Cheers,
> Aljoscha
>
> On Mon, 13 Jun 2016 at 11:04 Christophe Salperwyck <
> christophe.salperw...@gmail.com> wrote:
>
>> Thanks for the feedback and sorry that I can't try all this straight away.
>>
>> Is there a way to have a different function than:
>> WindowFunction()
>>
>> I would like to return a HBase Put and not a SummaryStatistics. So
>> something like this:
>> WindowFunction()
>>
>> Christophe
>>
>> 2016-06-09 17:47 GMT+02:00 Fabian Hueske :
>>
>>> OK, this indicates that the operator following the source is a
>>> bottleneck.
>>>
>>> If that's the WindowOperator, it makes sense to try the refactoring of
>>> the WindowFunction.
>>> Alternatively, you can try to run that operator with a higher
>>> parallelism.
>>>
>>> 2016-06-09 17:39 GMT+02:00 Christophe Salperwyck <
>>> christophe.salperw...@gmail.com>:
>>>
>>>> Hi Fabian,
>>>>
>>>> Thanks for the help, I will try that. The backpressure was on the
>>>> source (HBase).
>>>>
>>>> Christophe
>>>>
>>>> 2016-06-09 16:38 GMT+02:00 Fabian Hueske :
>>>>
>>>>> Hi Christophe,
>>>>>
>>>>> where does the backpressure appear? In front of the sink operator or
>>>>> before the window operator?
>>>>>
>>>>> In any case, I think you can improve your WindowFunction if you
>>>>> convert parts of it into a FoldFunction.
>>>>> The FoldFunction would take care of the statistics computation and the
>>>>> WindowFunction would only assemble the result record including extracting
>>>>> the start time of the window.
>>>>>
>>>>> Then you could do:
>>>>>
>>>>> ws.apply(new SummaryStatistics(), new YourFoldFunction(), new
>>>>> YourWindowFunction());
>>>>>
>>>>> This is more efficient because the FoldFunction is eagerly applied
>>>>> when ever a new element is added to a window. Hence, the window does only
>>>>> hold a single value (SummaryStatistics) instead of all element added to 
>>>>> the
>>>>> window. In contrast the WindowFunction is called when the window is 
>>>>> finally
>>>>> evaluated.
>>>>>
>>>>> Hope this helps,
>>>>> Fabian
>>>>>
>>>>> 2016-06-09 14:53 GMT+02:00 Christophe Salperwyck <
>>>>> christophe.salperw...@gmail.com>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am writing a program to read timeseries from HBase and do some
>>>>>> daily aggregations (Flink streaming). For now I am just computing some
>>>>>> average so not very consuming but my HBase read get slower and slower (I
>>>>>> have few billions of points to read). The back pressure is almost all the
>>>>>> time close to 1.
>>>>>>
>>>>>> I use custom timestamp:
>>>>>> env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
>>>>>>
>>>>>> so I implemented a custom extractor based on:
>>>>>> AscendingTimestampExtractor
>>>>>>
>>>>>> At the beginning I have 5M reads/s and after 15 min I have just 1M
>>>>>> read/s then it get worse and worse. Even when I cancel the job, data are
>>>>>> still being written in HBase (I did a sink similar to the example - with 
>>>>>> a
>>>>>> cache of 100s of HBase Puts to be a bit more efficient).
>>>>>>
>>>>>> When I don't put a sink it seems to stay on 1M reads/s.
>>>>>>
>>>>>> Do you have an idea why ?
>>>>>>
>>>>>> Here is a bit of code if needed:
>>>>>> final WindowedStream ws = hbaseDS.keyBy(0)
>>>>>> .assignTimestampsAndWatermarks(new AscendingTimestampExtractor())
>>>>>> .keyBy(0)
>>>>>> .timeWindow(Time.days(1));
>>>>>>
>>>>>> final SingleOutputStreamOperator puts = ws.apply(new
>>>>>> WindowFunction() {
>>>>>>
>>>>>> @Override
>>>>>> public void apply(final Tuple key, final TimeWindow window, final
>>>>>> Iterable input,
>>>>>> final Collector out) throws Exception {
>>>>>>
>>>>>> final SummaryStatistics summaryStatistics = new SummaryStatistics();
>>>>>> for (final ANA ana : input) {
>>>>>> summaryStatistics.addValue(ana.getValue());
>>>>>> }
>>>>>> final Put put = buildPut((String) key.getField(0), window.getStart(),
>>>>>> summaryStatistics);
>>>>>> out.collect(put);
>>>>>> }
>>>>>> });
>>>>>>
>>>>>> And how I started Flink on YARN :
>>>>>> flink-1.0.3/bin/yarn-session.sh -n 20 -tm 16384 -s 2
>>>>>> -Dtaskmanager.network.numberOfBuffers=4096
>>>>>>
>>>>>> Thanks for any feedback!
>>>>>>
>>>>>> Christophe
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>


Re: HBase reads and back pressure

2016-06-13 Thread Christophe Salperwyck
Hi Max,

In fact the Put would be the output of my WindowFunction. I saw Aljoscha
replied, seems I will need to create another intermediate class to handle
that. But it is fine.

Thx for help!
Christophe

2016-06-13 12:25 GMT+02:00 Maximilian Michels :

> Hi Christophe,
>
> A fold function has two inputs: The state and a record to update the
> state with. So you can update the SummaryStatistics (state) with each
> Put (input).
>
> Cheers,
> Max
>
> On Mon, Jun 13, 2016 at 11:04 AM, Christophe Salperwyck
>  wrote:
> > Thanks for the feedback and sorry that I can't try all this straight
> away.
> >
> > Is there a way to have a different function than:
> > WindowFunction()
> >
> > I would like to return a HBase Put and not a SummaryStatistics. So
> something
> > like this:
> > WindowFunction()
> >
> > Christophe
> >
> > 2016-06-09 17:47 GMT+02:00 Fabian Hueske :
> >>
> >> OK, this indicates that the operator following the source is a
> bottleneck.
> >>
> >> If that's the WindowOperator, it makes sense to try the refactoring of
> the
> >> WindowFunction.
> >> Alternatively, you can try to run that operator with a higher
> parallelism.
> >>
> >> 2016-06-09 17:39 GMT+02:00 Christophe Salperwyck
> >> :
> >>>
> >>> Hi Fabian,
> >>>
> >>> Thanks for the help, I will try that. The backpressure was on the
> source
> >>> (HBase).
> >>>
> >>> Christophe
> >>>
> >>> 2016-06-09 16:38 GMT+02:00 Fabian Hueske :
> >>>>
> >>>> Hi Christophe,
> >>>>
> >>>> where does the backpressure appear? In front of the sink operator or
> >>>> before the window operator?
> >>>>
> >>>> In any case, I think you can improve your WindowFunction if you
> convert
> >>>> parts of it into a FoldFunction.
> >>>> The FoldFunction would take care of the statistics computation and the
> >>>> WindowFunction would only assemble the result record including
> extracting
> >>>> the start time of the window.
> >>>>
> >>>> Then you could do:
> >>>>
> >>>> ws.apply(new SummaryStatistics(), new YourFoldFunction(), new
> >>>> YourWindowFunction());
> >>>>
> >>>> This is more efficient because the FoldFunction is eagerly applied
> when
> >>>> ever a new element is added to a window. Hence, the window does only
> hold a
> >>>> single value (SummaryStatistics) instead of all element added to the
> window.
> >>>> In contrast the WindowFunction is called when the window is finally
> >>>> evaluated.
> >>>>
> >>>> Hope this helps,
> >>>> Fabian
> >>>>
> >>>> 2016-06-09 14:53 GMT+02:00 Christophe Salperwyck
> >>>> :
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am writing a program to read timeseries from HBase and do some
> daily
> >>>>> aggregations (Flink streaming). For now I am just computing some
> average so
> >>>>> not very consuming but my HBase read get slower and slower (I have
> few
> >>>>> billions of points to read). The back pressure is almost all the
> time close
> >>>>> to 1.
> >>>>>
> >>>>> I use custom timestamp:
> >>>>> env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
> >>>>>
> >>>>> so I implemented a custom extractor based on:
> >>>>> AscendingTimestampExtractor
> >>>>>
> >>>>> At the beginning I have 5M reads/s and after 15 min I have just 1M
> >>>>> read/s then it get worse and worse. Even when I cancel the job, data
> are
> >>>>> still being written in HBase (I did a sink similar to the example -
> with a
> >>>>> cache of 100s of HBase Puts to be a bit more efficient).
> >>>>>
> >>>>> When I don't put a sink it seems to stay on 1M reads/s.
> >>>>>
> >>>>> Do you have an idea why ?
> >>>>>
> >>>>> Here is a bit of code if needed:
> >>>>> final WindowedStream ws = hbaseDS.keyBy(0)
> >>>>> .assignTimestampsAndWatermarks(new AscendingTimestampExtractor())
> >>>>> .keyBy(0)
> >>>>> .timeWindow(Time.days(1));
> >>>>>
> >>>>> final SingleOutputStreamOperator puts = ws.apply(new
> >>>>> WindowFunction() {
> >>>>>
> >>>>> @Override
> >>>>> public void apply(final Tuple key, final TimeWindow window, final
> >>>>> Iterable input,
> >>>>> final Collector out) throws Exception {
> >>>>>
> >>>>> final SummaryStatistics summaryStatistics = new SummaryStatistics();
> >>>>> for (final ANA ana : input) {
> >>>>> summaryStatistics.addValue(ana.getValue());
> >>>>> }
> >>>>> final Put put = buildPut((String) key.getField(0), window.getStart(),
> >>>>> summaryStatistics);
> >>>>> out.collect(put);
> >>>>> }
> >>>>> });
> >>>>>
> >>>>> And how I started Flink on YARN :
> >>>>> flink-1.0.3/bin/yarn-session.sh -n 20 -tm 16384 -s 2
> >>>>> -Dtaskmanager.network.numberOfBuffers=4096
> >>>>>
> >>>>> Thanks for any feedback!
> >>>>>
> >>>>> Christophe
> >>>>
> >>>>
> >>>
> >>
> >
>


Re: HBase reads and back pressure

2016-06-13 Thread Christophe Salperwyck
Thanks for the feedback and sorry that I can't try all this straight away.

Is there a way to have a different function than:
WindowFunction()

I would like to return a HBase Put and not a SummaryStatistics. So
something like this:
WindowFunction()

Christophe

2016-06-09 17:47 GMT+02:00 Fabian Hueske :

> OK, this indicates that the operator following the source is a bottleneck.
>
> If that's the WindowOperator, it makes sense to try the refactoring of the
> WindowFunction.
> Alternatively, you can try to run that operator with a higher parallelism.
>
> 2016-06-09 17:39 GMT+02:00 Christophe Salperwyck <
> christophe.salperw...@gmail.com>:
>
>> Hi Fabian,
>>
>> Thanks for the help, I will try that. The backpressure was on the source
>> (HBase).
>>
>> Christophe
>>
>> 2016-06-09 16:38 GMT+02:00 Fabian Hueske :
>>
>>> Hi Christophe,
>>>
>>> where does the backpressure appear? In front of the sink operator or
>>> before the window operator?
>>>
>>> In any case, I think you can improve your WindowFunction if you convert
>>> parts of it into a FoldFunction.
>>> The FoldFunction would take care of the statistics computation and the
>>> WindowFunction would only assemble the result record including extracting
>>> the start time of the window.
>>>
>>> Then you could do:
>>>
>>> ws.apply(new SummaryStatistics(), new YourFoldFunction(), new
>>> YourWindowFunction());
>>>
>>> This is more efficient because the FoldFunction is eagerly applied when
>>> ever a new element is added to a window. Hence, the window does only hold a
>>> single value (SummaryStatistics) instead of all element added to the
>>> window. In contrast the WindowFunction is called when the window is finally
>>> evaluated.
>>>
>>> Hope this helps,
>>> Fabian
>>>
>>> 2016-06-09 14:53 GMT+02:00 Christophe Salperwyck <
>>> christophe.salperw...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> I am writing a program to read timeseries from HBase and do some daily
>>>> aggregations (Flink streaming). For now I am just computing some average so
>>>> not very consuming but my HBase read get slower and slower (I have few
>>>> billions of points to read). The back pressure is almost all the time close
>>>> to 1.
>>>>
>>>> I use custom timestamp:
>>>> env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
>>>>
>>>> so I implemented a custom extractor based on:
>>>> AscendingTimestampExtractor
>>>>
>>>> At the beginning I have 5M reads/s and after 15 min I have just 1M
>>>> read/s then it get worse and worse. Even when I cancel the job, data are
>>>> still being written in HBase (I did a sink similar to the example - with a
>>>> cache of 100s of HBase Puts to be a bit more efficient).
>>>>
>>>> When I don't put a sink it seems to stay on 1M reads/s.
>>>>
>>>> Do you have an idea why ?
>>>>
>>>> Here is a bit of code if needed:
>>>> final WindowedStream ws = hbaseDS.keyBy(0)
>>>> .assignTimestampsAndWatermarks(new AscendingTimestampExtractor())
>>>> .keyBy(0)
>>>> .timeWindow(Time.days(1));
>>>>
>>>> final SingleOutputStreamOperator puts = ws.apply(new
>>>> WindowFunction() {
>>>>
>>>> @Override
>>>> public void apply(final Tuple key, final TimeWindow window, final
>>>> Iterable input,
>>>> final Collector out) throws Exception {
>>>>
>>>> final SummaryStatistics summaryStatistics = new SummaryStatistics();
>>>> for (final ANA ana : input) {
>>>> summaryStatistics.addValue(ana.getValue());
>>>> }
>>>> final Put put = buildPut((String) key.getField(0), window.getStart(),
>>>> summaryStatistics);
>>>> out.collect(put);
>>>> }
>>>> });
>>>>
>>>> And how I started Flink on YARN :
>>>> flink-1.0.3/bin/yarn-session.sh -n 20 -tm 16384 -s 2
>>>> -Dtaskmanager.network.numberOfBuffers=4096
>>>>
>>>> Thanks for any feedback!
>>>>
>>>> Christophe
>>>>
>>>
>>>
>>
>


Re: HBase reads and back pressure

2016-06-09 Thread Christophe Salperwyck
Hi Fabian,

Thanks for the help, I will try that. The backpressure was on the source
(HBase).

Christophe

2016-06-09 16:38 GMT+02:00 Fabian Hueske :

> Hi Christophe,
>
> where does the backpressure appear? In front of the sink operator or
> before the window operator?
>
> In any case, I think you can improve your WindowFunction if you convert
> parts of it into a FoldFunction.
> The FoldFunction would take care of the statistics computation and the
> WindowFunction would only assemble the result record including extracting
> the start time of the window.
>
> Then you could do:
>
> ws.apply(new SummaryStatistics(), new YourFoldFunction(), new
> YourWindowFunction());
>
> This is more efficient because the FoldFunction is eagerly applied when
> ever a new element is added to a window. Hence, the window does only hold a
> single value (SummaryStatistics) instead of all element added to the
> window. In contrast the WindowFunction is called when the window is finally
> evaluated.
>
> Hope this helps,
> Fabian
>
> 2016-06-09 14:53 GMT+02:00 Christophe Salperwyck <
> christophe.salperw...@gmail.com>:
>
>> Hi,
>>
>> I am writing a program to read timeseries from HBase and do some daily
>> aggregations (Flink streaming). For now I am just computing some average so
>> not very consuming but my HBase read get slower and slower (I have few
>> billions of points to read). The back pressure is almost all the time close
>> to 1.
>>
>> I use custom timestamp:
>> env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
>>
>> so I implemented a custom extractor based on:
>> AscendingTimestampExtractor
>>
>> At the beginning I have 5M reads/s and after 15 min I have just 1M read/s
>> then it get worse and worse. Even when I cancel the job, data are still
>> being written in HBase (I did a sink similar to the example - with a cache
>> of 100s of HBase Puts to be a bit more efficient).
>>
>> When I don't put a sink it seems to stay on 1M reads/s.
>>
>> Do you have an idea why ?
>>
>> Here is a bit of code if needed:
>> final WindowedStream ws = hbaseDS.keyBy(0)
>> .assignTimestampsAndWatermarks(new AscendingTimestampExtractor())
>> .keyBy(0)
>> .timeWindow(Time.days(1));
>>
>> final SingleOutputStreamOperator puts = ws.apply(new
>> WindowFunction() {
>>
>> @Override
>> public void apply(final Tuple key, final TimeWindow window, final
>> Iterable input,
>> final Collector out) throws Exception {
>>
>> final SummaryStatistics summaryStatistics = new SummaryStatistics();
>> for (final ANA ana : input) {
>> summaryStatistics.addValue(ana.getValue());
>> }
>> final Put put = buildPut((String) key.getField(0), window.getStart(),
>> summaryStatistics);
>> out.collect(put);
>> }
>> });
>>
>> And how I started Flink on YARN :
>> flink-1.0.3/bin/yarn-session.sh -n 20 -tm 16384 -s 2
>> -Dtaskmanager.network.numberOfBuffers=4096
>>
>> Thanks for any feedback!
>>
>> Christophe
>>
>
>


Re: Hourly top-k statistics of DataStream

2016-06-09 Thread Christophe Salperwyck
Hi,

There are some implementations to do that with low memory footprint. Have a
look at the count min sketch for example. There are some Java
implementations.

Christophe

2016-06-09 15:29 GMT+02:00 Yukun Guo :

> Thank you very much for the detailed answer. Now I understand a DataStream
> can be repartitioned or “joined” (don’t know the exact terminology) with
> keyBy.
>
> But another question:
> Despite the non-existence of incremental top-k algorithm, I’d like to
> incrementally compute the local word count during one hour, probably using
> a TreeMap for counting. As soon as the hour finishes, the TreeMap is
> converted to a stream of Tuple2 and forwarded to the remaining computation
> thereafter. I’m concerned about the memory usage: the TreeMap and the
> Tuple2 collection hold a huge amount of items, do I have to do some custom
> memory management?
>
> I’m also not sure whether a TreeMap is suitable here. This StackOverflow
> question presents a similar approach:
> http://stackoverflow.com/questions/34681887/how-apache-flink-deal-with-skewed-data,
> but the suggested solution seems rather complicated.
>
> On 8 June 2016 at 08:04, Jamie Grier  wrote:
>
>> Suggestions in-line below...
>>
>> On Mon, Jun 6, 2016 at 7:26 PM, Yukun Guo  wrote:
>>
>>> Hi,
>>>
>>> I'm working on a project which uses Flink to compute hourly log
>>> statistics
>>> like top-K. The logs are fetched from Kafka by a FlinkKafkaProducer and
>>> packed
>>> into a DataStream.
>>>
>>> The problem is, I find the computation quite challenging to express with
>>> Flink's DataStream API:
>>>
>>> 1. If I use something like `logs.timeWindow(Time.hours(1))`, suppose
>>> that the
>>> data volume is really high, e.g., billions of logs might be generated in
>>> one
>>> hour, will the window grow too large and can't be handled efficiently?
>>>
>>
>> In the general case you can use:
>>
>> stream
>> .timeWindow(...)
>> .apply(reduceFunction, windowFunction)
>>
>> which can take a ReduceFunction and a WindowFunction.  The ReduceFunction
>> is used to reduce the state on the fly and thereby keep the total state
>> size low.  This can commonly be used in analytics applications to reduce
>> the state size that you're accumulating for each window.  In the specific
>> case of TopK, however, you cannot do this if you want an exact result.  To
>> get an exact result I believe you have to actually keep around all of the
>> data and then calculate TopK at the end in your WindowFunction.  If you are
>> able to use approximate algorithms for your use case than you can calculate
>> a probabilistic incremental TopK based on some sort of sketch-based
>> algorithm.
>>
>>>
>>> 2. We have to create a `KeyedStream` before applying `timeWindow`.
>>> However,
>>> the distribution of some keys are skewed hence using them may compromise
>>> the performance due to unbalanced partition loads. (What I want is just
>>> rebalance the stream across all partitions.)
>>>
>>
>> A good and simple way to approach this may be to come up with a composite
>> key for your data that *is* uniformly distributed.  You can imagine
>> something simple like 'natural_key:random_number'.  Then keyBy(natural_key)
>> and reduce() again.  For example:
>>
>> stream
>> .keyBy(key, rand())  // partition by composite key that is
>> uniformly distributed
>> .timeWindow(1 hour)
>> .reduce() // pre-aggregation
>> .keyBy(key)// repartition
>> .timeWindow(1 hour)
>> .reduce() // final aggregation
>>
>>
>>>
>>> 3. The top-K algorithm can be straightforwardly implemented with
>>> `DataSet`'s
>>> `mapPartition` and `reduceGroup` API as in
>>> [FLINK-2549](https://github.com/apache/flink/pull/1161/), but not so
>>> easy if
>>> taking the DataStream approach, even with the stateful operators. I still
>>> cannot figure out how to reunion streams once they are partitioned.
>>>
>>> I'm not sure I know what you're trying to do here.  What do you mean
>> by re-union?
>>
>>
>>> 4. Is it possible to convert a DataStream into a DataSet? If yes, how
>>> can I
>>> make Flink analyze the data incrementally rather than aggregating the
>>> logs for
>>> one hour before starting to process?
>>>
>>> There is no direct way to turn a DataStream into a DataSet.  I addressed
>> the point about doing the computation incrementally above, though.  You do
>> this with a ReduceFunction.  But again, there doesn't exist an exact
>> incremental TopK algorithm that I'm aware of.  This can be done with
>> sketching, though.
>>
>>
>> --
>>
>> Jamie Grier
>> data Artisans, Director of Applications Engineering
>> @jamiegrier 
>> ja...@data-artisans.com
>>
>>
>


HBase reads and back pressure

2016-06-09 Thread Christophe Salperwyck
Hi,

I am writing a program to read timeseries from HBase and do some daily
aggregations (Flink streaming). For now I am just computing some average so
not very consuming but my HBase read get slower and slower (I have few
billions of points to read). The back pressure is almost all the time close
to 1.

I use custom timestamp:
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

so I implemented a custom extractor based on:
AscendingTimestampExtractor

At the beginning I have 5M reads/s and after 15 min I have just 1M read/s
then it get worse and worse. Even when I cancel the job, data are still
being written in HBase (I did a sink similar to the example - with a cache
of 100s of HBase Puts to be a bit more efficient).

When I don't put a sink it seems to stay on 1M reads/s.

Do you have an idea why ?

Here is a bit of code if needed:
final WindowedStream ws = hbaseDS.keyBy(0)
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor())
.keyBy(0)
.timeWindow(Time.days(1));

final SingleOutputStreamOperator puts = ws.apply(new
WindowFunction() {

@Override
public void apply(final Tuple key, final TimeWindow window, final
Iterable input,
final Collector out) throws Exception {

final SummaryStatistics summaryStatistics = new SummaryStatistics();
for (final ANA ana : input) {
summaryStatistics.addValue(ana.getValue());
}
final Put put = buildPut((String) key.getField(0), window.getStart(),
summaryStatistics);
out.collect(put);
}
});

And how I started Flink on YARN :
flink-1.0.3/bin/yarn-session.sh -n 20 -tm 16384 -s 2
-Dtaskmanager.network.numberOfBuffers=4096

Thanks for any feedback!

Christophe


Re: HBase Input Format for streaming

2016-06-06 Thread Christophe Salperwyck
I just did that:

public T nextRecord(final T reuse) throws IOException {
if (this.rs == null){
// throw new IOException("No table result scanner provided!");
return null;
}
...

because in the class FileSourceFunction we have:

 @Override public void run(SourceContext ctx) throws Exception { while
(isRunning) { OUT nextElement = serializer.createInstance(); nextElement =
format.nextRecord(nextElement); if (nextElement == null &&
splitIterator.hasNext()) { format.open(splitIterator.next()); continue; }
else if (nextElement == null) { break; } ctx.collect(nextElement); } }

(I had to copy TableInputSplit as its constructor is not visible...)


2016-06-06 16:07 GMT+02:00 Ufuk Celebi :

> From the code it looks like the open method of the TableInputFormat is
> never called. What are you doing differently in the
> StreamingTableInputFormat?
>
> – Ufuk
>
>
> On Mon, Jun 6, 2016 at 1:49 PM, Christophe Salperwyck
>  wrote:
> > Hi all,
> >
> > I am trying to read data from HBase and use the windows functions of
> Flink
> > streaming. I can read my data using the ExecutionEnvironment but not from
> > the StreamExecutionEnvironment.
> >
> > Is that a known issue?
> >
> > Are the inputsplits used in the streaming environment?
> >
> > Here a sample of my code:
> >
> > final StreamExecutionEnvironment env =
> > StreamExecutionEnvironment.getExecutionEnvironment();
> > env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
> >
> > @SuppressWarnings("serial")
> > final DataStreamSource anaDS = env.createInput(new
> > TableInputFormat() {
> > ...
> > }
> >
> > final WindowedStream ws = anaDS.
> > assignTimestampsAndWatermarks(new
> AssignerWithPunctuatedWatermarks()).
> > keyBy(0).
> > timeWindow(Time.days(30), Time.days(30));
> >
> > ws.sum(2).printToErr();
> > env.execute();
> >
> > The error I get is:
> > Caused by: java.io.IOException: No table result scanner provided!
> > at
> >
> org.apache.flink.addons.hbase.TableInputFormat.nextRecord(TableInputFormat.java:103)
> >
> > It seems the "Result" is not read for a first time before calling this
> > function.
> >
> > I built a "StreamingTableInputFormat" as a temporary work around but let
> me
> > know if there is something I did wrong.
> >
> > Thanks for everything, Flink is great!
> >
> > Cheers,
> > Christophe
>


HBase Input Format for streaming

2016-06-06 Thread Christophe Salperwyck
Hi all,

I am trying to read data from HBase and use the windows functions of Flink
streaming. I can read my data using the ExecutionEnvironment but not from
the StreamExecutionEnvironment.

Is that a known issue?

Are the inputsplits used in the streaming environment?

Here a sample of my code:

final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

@SuppressWarnings("serial")
final DataStreamSource anaDS = env.createInput(new
TableInputFormat() {
...
}

final WindowedStream ws = anaDS.
assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks()).
keyBy(0).
timeWindow(Time.days(30), Time.days(30));

ws.sum(2).printToErr();
env.execute();

The error I get is:
Caused by: java.io.IOException: No table result scanner provided!
at
org.apache.flink.addons.hbase.TableInputFormat.nextRecord(TableInputFormat.java:103)

It seems the "Result" is not read for a first time before calling this
function.

I built a "StreamingTableInputFormat" as a temporary work around but let me
know if there is something I did wrong.

Thanks for everything, Flink is great!

Cheers,
Christophe


unsubscribe

2016-05-20 Thread Christophe Salperwyck



Re: Possible use case: Simulating iterative batch processing by rewinding source

2016-04-06 Thread Christophe Salperwyck
Hi,

I am interested too. For my part, I was thinking to use HBase as a backend
so that my data are stored sorted. Nice to have to generate timeseries in
the good order.

Cheers,
Christophe

2016-04-06 21:22 GMT+02:00 Raul Kripalani :

> Hello,
>
> I'm getting started with Flink for a use case that could leverage the
> window processing abilities of Flink that Spark does not offer.
>
> Basically I have dumps of timeseries data (10y in ticks) which I need to
> calculate many metrics in an exploratory manner based on event time. NOTE:
> I don't have the metrics beforehand, it's gonna be an exploratory and
> iterative data analytics effort.
>
> Flink doesn't seem to support windows on batch processing, so I'm thinking
> about emulating batch by using the Kafka stream connector and rewinding the
> data stream for every new metric that I calculate, to process the full
> timeseries series in a batch.
>
> Each metric I calculate should in turn be sent to another Kafka topic so I
> can use it in a subsequent processing batch, e.g.
>
> Iteration 1)   raw timeseries data ---> metric1
> Iteration 2)   raw timeseries data + metric1 (composite) ---> metric2
> Iteration 3)   metric1 + metric2 ---> metric3
> Iteration 4)   raw timeseries data + metric3 ---> metric4
> ...
>
> Does this sound like a usecase for Flink? Could you guide me a little bit
> on whether this is feasible currently?
>
> Cheers,
>
> *Raúl Kripalani*
> PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> Messaging Engineer
> http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> Blog: raul.io
>  |
> twitter: @raulvk 
>


Re: Running Flink jobs directly from Eclipse

2016-04-06 Thread Christophe Salperwyck
I exported it in an environment variable before starting Flink:
flink-0.10.1/bin/yarn-session.sh -n 64 -s 4 -jm 4096 -tm 4096

2016-04-06 15:36 GMT+02:00 Serhiy Boychenko :

> What about YARN(and HDFS) configuration? I put yarn-site.xml directly into
> classpath? Or I can set the variables in the execution environment? I will
> give it a try tomorrow morning, will report back and if successful blog
> about it ofc J
>
>
>
> *From:* Christophe Salperwyck [mailto:christophe.salperw...@gmail.com]
> *Sent:* 06 April 2016 13:41
> *To:* user@flink.apache.org
> *Subject:* Re: Running Flink jobs directly from Eclipse
>
>
>
> For me it was taking the local jar and uploading it into the cluster.
>
>
>
> 2016-04-06 13:16 GMT+02:00 Shannon Carey :
>
> Thanks for the info! It is a bit difficult to tell based on the
> documentation whether or not you need to put your jar onto the Flink master
> node and run the flink command from there in order to get a job running.
> The documentation on
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/yarn_setup.html
>  isn't
> very explicit about where you can run the flink command from, and doesn't
> mention that you can run the job programmatically instead of using the CLI.
>
>
>
> *From: *Christophe Salperwyck 
> *Date: *Wednesday, April 6, 2016 at 1:24 PM
> *To: *
> *Subject: *Re: Running Flink jobs directly from Eclipse
>
>
>
> From my side I was starting the YARN session from the cluster:
>
> flink-0.10.1/bin/yarn-session.sh -n 64 -s 4 -jm 4096 -tm 4096
>
>
>
> Then getting the IP/port from the WebUI and then from Eclipse:
>
> ExecutionEnvironment env =
> ExecutionEnvironment.createRemoteEnvironment("xx.xx.xx.xx", 40631,
> "target/FlinkTest-0.0.1-SNAPSHOT-jar-with-dependencies.jar");
>
>
>
> The JAR need to be compiled before.
>
>
>
> Hope it helps!
>
> Christophe
>
>
>
> 2016-04-06 9:25 GMT+02:00 Serhiy Boychenko :
>
> Cheerz,
>
>
>
> I have been working last few month on the comparison of different data
> processing engines and recently came across Apache Flink. After reading
> different academic papers on comparison of Flink with other data processing
> I would definitely give it a shot. The only issue I am currently having is
> that I am unable to submit Flink jobs directly from Eclipse (to YARN
> cluster). I am wondering if you got any guildelines how I could do the
> submission not from the client but from Eclipse directly? (I was unable to
> find anything related, with the exception of setting up Eclipse for working
> on Flink core)
>
>
>
> Best regards,
>
> Serhiy.
>
>
>
>
>
>
>


Re: Running Flink jobs directly from Eclipse

2016-04-06 Thread Christophe Salperwyck
For me it was taking the local jar and uploading it into the cluster.

2016-04-06 13:16 GMT+02:00 Shannon Carey :

> Thanks for the info! It is a bit difficult to tell based on the
> documentation whether or not you need to put your jar onto the Flink master
> node and run the flink command from there in order to get a job running.
> The documentation on
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/yarn_setup.html
>  isn't
> very explicit about where you can run the flink command from, and doesn't
> mention that you can run the job programmatically instead of using the CLI.
>
> From: Christophe Salperwyck 
> Date: Wednesday, April 6, 2016 at 1:24 PM
> To: 
> Subject: Re: Running Flink jobs directly from Eclipse
>
> From my side I was starting the YARN session from the cluster:
> flink-0.10.1/bin/yarn-session.sh -n 64 -s 4 -jm 4096 -tm 4096
>
> Then getting the IP/port from the WebUI and then from Eclipse:
> ExecutionEnvironment env =
> ExecutionEnvironment.createRemoteEnvironment("xx.xx.xx.xx", 40631,
> "target/FlinkTest-0.0.1-SNAPSHOT-jar-with-dependencies.jar");
>
> The JAR need to be compiled before.
>
> Hope it helps!
> Christophe
>
> 2016-04-06 9:25 GMT+02:00 Serhiy Boychenko :
>
>> Cheerz,
>>
>>
>>
>> I have been working last few month on the comparison of different data
>> processing engines and recently came across Apache Flink. After reading
>> different academic papers on comparison of Flink with other data processing
>> I would definitely give it a shot. The only issue I am currently having is
>> that I am unable to submit Flink jobs directly from Eclipse (to YARN
>> cluster). I am wondering if you got any guildelines how I could do the
>> submission not from the client but from Eclipse directly? (I was unable to
>> find anything related, with the exception of setting up Eclipse for working
>> on Flink core)
>>
>>
>>
>> Best regards,
>>
>> Serhiy.
>>
>>
>>
>
>


Re: Running Flink jobs directly from Eclipse

2016-04-06 Thread Christophe Salperwyck
>From my side I was starting the YARN session from the cluster:
flink-0.10.1/bin/yarn-session.sh -n 64 -s 4 -jm 4096 -tm 4096

Then getting the IP/port from the WebUI and then from Eclipse:
ExecutionEnvironment env =
ExecutionEnvironment.createRemoteEnvironment("xx.xx.xx.xx", 40631,
"target/FlinkTest-0.0.1-SNAPSHOT-jar-with-dependencies.jar");

The JAR need to be compiled before.

Hope it helps!
Christophe

2016-04-06 9:25 GMT+02:00 Serhiy Boychenko :

> Cheerz,
>
>
>
> I have been working last few month on the comparison of different data
> processing engines and recently came across Apache Flink. After reading
> different academic papers on comparison of Flink with other data processing
> I would definitely give it a shot. The only issue I am currently having is
> that I am unable to submit Flink jobs directly from Eclipse (to YARN
> cluster). I am wondering if you got any guildelines how I could do the
> submission not from the client but from Eclipse directly? (I was unable to
> find anything related, with the exception of setting up Eclipse for working
> on Flink core)
>
>
>
> Best regards,
>
> Serhiy.
>
>
>


Re: XGBoost4J: Portable Distributed XGboost in Flink

2016-03-15 Thread Christophe Salperwyck
Hi,

The paper compares the performance between your XGBoost and the Spark MLlib
version. It would be nice to see how it scales when using Spark or Flink as
an engine and also compare it to your native distributed version (with
rabit, right?).

If you have some charts, they are welcome :-)

BTW, where did you submit this paper (if not confidential of course)?

Thanks!
Christophe


2016-03-15 0:41 GMT+01:00 Tianqi Chen :

> Hi Flink Community:
> I am sending this email to let you know we just release XGBoost4J
> which also runs on Flink. In short, XGBoost is a machine learning package
> that is used by more than half of the machine challenge winning solutions
> and is already widely used in industry. The distributed version scale to
> billion examples(10x faster than spark.mllib in the experiment) with fewer
> resources (see .http://arxiv.org/abs/1603.02754)
>
>See our blogpost for more details
> http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html
>   We
> would love to have you try it out and helo us to make it better.
>
> Cheers
>


Fwd: Flink 0.10.1 and HBase

2016-01-25 Thread Christophe Salperwyck
Hi all,

I have an issue with Flink 0.10.1, HBase and Guava, it seems to be related
to this JIRA:
https://issues.apache.org/jira/browse/FLINK-3158

If I removed the com.google.common.* class files from the jar file, it
works then.

Is there any other way to deal with this problem?

Thanks for your work!