date:20160216

Re: where can get the summary changes between flink-1.0 and flink-0.10

2016-02-16 Thread Chiwan Park

Hi Zhijiang,

We have wiki pages about description of Flink 1.0 relesase [1] [2]. But the 
pages are not updated in realtime. It is possible that there are some changes 
that haven’t been described.

After releasing 1.0 officially, maybe we post an article dealing with the 
changes in 1.0 to the Flink blog [3].

Regards,
Chiwan Park

[1]: https://cwiki.apache.org/confluence/display/FLINK/1.0+Release
[2]: 
https://cwiki.apache.org/confluence/display/FLINK/Maven+artifact+names+suffixed+with+Scala+version
[3]: http://flink.apache.org/blog/

> On Feb 17, 2016, at 3:34 PM, wangzhijiang999  
> wrote:
> 
> Hi,
> Where can get the summary changes between flink-1.0 and flink-0.10,  
> thank you in advance!
> 
>  
> 
> 
> 
> Best Regards,
> 
> Zhijiang Wang

where can get the summary changes between flink-1.0 and flink-0.10

2016-02-16 Thread wangzhijiang999

Hi,    Where can get the summary changes between flink-1.0 and flink-0.10,  
thank you in advance!
 


Best Regards,
Zhijiang Wang

streaming hdfs sub folders

2016-02-16 Thread Martin Neumann

Hi,

I have a streaming machine learning job that usually runs with input from
kafka. To tweak the models I need to run on some old data from HDFS.

Unfortunately the data on HDFS is spread out over several subfolders.
Basically I have a datum with one subfolder for each hour within those are
the actual input files I'm interested in.

Basically what I need is a source that goes through the subfolder in order
and streams the files into the program. I'm using event timestamps so all
files in 00 need to be processed before 01.

Has anyone an idea on how to do this?

cheers Martin

Re: Read once input data?

2016-02-16 Thread Saliya Ekanayake

Thank you. I'll check this

On Tue, Feb 16, 2016 at 4:01 PM, Fabian Hueske  wrote:

> Broadcasted DataSets are stored on the JVM heap of each task manager (but
> shared among multiple slots on the same TM), hence the size restriction.
>
> There are two ways to retrieve a DataSet (such as the result of a reduce).
> 1) if you want to fetch the result into your client program use
> DataSet.collect(). This immediately triggers an execution and fetches the
> result from the cluster.
> 2) if you want to use the result for a computation in the cluster use
> broadcast sets as described above.
>
> 2016-02-16 21:54 GMT+01:00 Saliya Ekanayake :
>
>> Thank you, yes, this makes sense. The broadcasted data in my case would a
>> large array of 3D coordinates,
>>
>> On a side note, how can I take the output from a reduce function? I can
>> see methods to write it to a given output, but is it possible to retrieve
>> the reduced result back to the program - like a double value representing
>> the average in the previous example.
>>
>>
>> On Tue, Feb 16, 2016 at 3:47 PM, Fabian Hueske  wrote:
>>
>>> You can use so-called BroadcastSets to send any sufficiently small
>>> DataSet (such as a computed average) to any other function and use it there.
>>> However, in your case you'll end up with a data flow that branches (at
>>> the source) and merges again (when the average is send to the second map).
>>> Such patterns can cause deadlocks and can therefore not be pipelined
>>> which means that the data before the branch is written to disk and read
>>> again.
>>> In your case it might be even better to read the data twice instead of
>>> reading, writing, and reading it.
>>>
>>> Fabian
>>>
>>> 2016-02-16 21:15 GMT+01:00 Saliya Ekanayake :
>>>
 I looked at the samples and I think what you meant is clear, but I
 didn't find a solution for my need. In my case, I want to use the result
 from first map operation before I can apply the second map on the
 *same* data set. For simplicity, let's say I've a bunch of short
 values represented as my data set. Then I need to find their average, so I
 use a map and reduce. Then I want to map these short values with another
 function, but it needs that average computed in the beginning to work
 correctly.

 Is this possible without doing multiple reads of the input data to
 create the same dataset?

 Thank you,
 saliya

 On Tue, Feb 16, 2016 at 12:03 PM, Fabian Hueske 
 wrote:

> Yes, if you implement both maps in a single job, data is read once.
>
> 2016-02-16 15:53 GMT+01:00 Saliya Ekanayake :
>
>> Fabian,
>>
>> I've a quick follow-up question on what you suggested. When streaming
>> the same data through different maps, were you implying that everything
>> goes as single job in Flink, so data read happens only once?
>>
>> Thanks,
>> Saliya
>>
>> On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske 
>> wrote:
>>
>>> It is not possible to "pin" data sets in memory, yet.
>>> However, you can stream the same data set through two different
>>> mappers at the same time.
>>>
>>> For instance you can have a job like:
>>>
>>>  /---> Map 1 --> SInk1
>>> Source --<
>>>  \---> Map 2 --> SInk2
>>>
>>> and execute it at once.
>>> For that you define you data flow and call execute once after all
>>> sinks have been created.
>>>
>>> Best, Fabian
>>>
>>> 2016-02-15 21:32 GMT+01:00 Saliya Ekanayake :
>>>
 Fabian,

 count() was just an example. What I would like to do is say run two
 map operations on the dataset (ds). Each map will have it's own 
 reduction,
 so is there a way to avoid creating two jobs for such scenario?

 The reason is, reading these binary matrices are expensive. In our
 current MPI implementation, I am using memory maps for faster loading 
 and
 reuse.

 Thank you,
 Saliya

 On Mon, Feb 15, 2016 at 3:15 PM, Fabian Hueske 
 wrote:

> Hi,
>
> it looks like you are executing two distinct Flink jobs.
> DataSet.count() triggers the execution of a new job. If you have
> an execute() call in your program, this will lead to two Flink jobs 
> being
> executed.
> It is not possible to share state among these jobs.
>
> Maybe you should add a custom count implementation (using a
> ReduceFunction) which is executed in the same program as the other
> ReduceFunction.
>
> Best, Fabian
>
>
>
> 2016-02-15 21:05 GMT+01:00 Saliya Ekanayake :
>
>> Hi,
>>
>> I see that an InputFormat's open() and nextRecord() methods get
>> called for each terminal operation on

Re: Read once input data?

2016-02-16 Thread Fabian Hueske

Broadcasted DataSets are stored on the JVM heap of each task manager (but
shared among multiple slots on the same TM), hence the size restriction.

There are two ways to retrieve a DataSet (such as the result of a reduce).
1) if you want to fetch the result into your client program use
DataSet.collect(). This immediately triggers an execution and fetches the
result from the cluster.
2) if you want to use the result for a computation in the cluster use
broadcast sets as described above.

2016-02-16 21:54 GMT+01:00 Saliya Ekanayake :

> Thank you, yes, this makes sense. The broadcasted data in my case would a
> large array of 3D coordinates,
>
> On a side note, how can I take the output from a reduce function? I can
> see methods to write it to a given output, but is it possible to retrieve
> the reduced result back to the program - like a double value representing
> the average in the previous example.
>
>
> On Tue, Feb 16, 2016 at 3:47 PM, Fabian Hueske  wrote:
>
>> You can use so-called BroadcastSets to send any sufficiently small
>> DataSet (such as a computed average) to any other function and use it there.
>> However, in your case you'll end up with a data flow that branches (at
>> the source) and merges again (when the average is send to the second map).
>> Such patterns can cause deadlocks and can therefore not be pipelined
>> which means that the data before the branch is written to disk and read
>> again.
>> In your case it might be even better to read the data twice instead of
>> reading, writing, and reading it.
>>
>> Fabian
>>
>> 2016-02-16 21:15 GMT+01:00 Saliya Ekanayake :
>>
>>> I looked at the samples and I think what you meant is clear, but I
>>> didn't find a solution for my need. In my case, I want to use the result
>>> from first map operation before I can apply the second map on the *same* 
>>> data
>>> set. For simplicity, let's say I've a bunch of short values represented as
>>> my data set. Then I need to find their average, so I use a map and reduce.
>>> Then I want to map these short values with another function, but it needs
>>> that average computed in the beginning to work correctly.
>>>
>>> Is this possible without doing multiple reads of the input data to
>>> create the same dataset?
>>>
>>> Thank you,
>>> saliya
>>>
>>> On Tue, Feb 16, 2016 at 12:03 PM, Fabian Hueske 
>>> wrote:
>>>
 Yes, if you implement both maps in a single job, data is read once.

 2016-02-16 15:53 GMT+01:00 Saliya Ekanayake :

> Fabian,
>
> I've a quick follow-up question on what you suggested. When streaming
> the same data through different maps, were you implying that everything
> goes as single job in Flink, so data read happens only once?
>
> Thanks,
> Saliya
>
> On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske 
> wrote:
>
>> It is not possible to "pin" data sets in memory, yet.
>> However, you can stream the same data set through two different
>> mappers at the same time.
>>
>> For instance you can have a job like:
>>
>>  /---> Map 1 --> SInk1
>> Source --<
>>  \---> Map 2 --> SInk2
>>
>> and execute it at once.
>> For that you define you data flow and call execute once after all
>> sinks have been created.
>>
>> Best, Fabian
>>
>> 2016-02-15 21:32 GMT+01:00 Saliya Ekanayake :
>>
>>> Fabian,
>>>
>>> count() was just an example. What I would like to do is say run two
>>> map operations on the dataset (ds). Each map will have it's own 
>>> reduction,
>>> so is there a way to avoid creating two jobs for such scenario?
>>>
>>> The reason is, reading these binary matrices are expensive. In our
>>> current MPI implementation, I am using memory maps for faster loading 
>>> and
>>> reuse.
>>>
>>> Thank you,
>>> Saliya
>>>
>>> On Mon, Feb 15, 2016 at 3:15 PM, Fabian Hueske 
>>> wrote:
>>>
 Hi,

 it looks like you are executing two distinct Flink jobs.
 DataSet.count() triggers the execution of a new job. If you have an
 execute() call in your program, this will lead to two Flink jobs being
 executed.
 It is not possible to share state among these jobs.

 Maybe you should add a custom count implementation (using a
 ReduceFunction) which is executed in the same program as the other
 ReduceFunction.

 Best, Fabian



 2016-02-15 21:05 GMT+01:00 Saliya Ekanayake :

> Hi,
>
> I see that an InputFormat's open() and nextRecord() methods get
> called for each terminal operation on a given dataset using that 
> particular
> InputFormat. Is it possible to avoid this - possibly using some 
> caching
> technique in Flink?
>
> For example, I've some code like below a

Re: Read once input data?

2016-02-16 Thread Saliya Ekanayake

Thank you, yes, this makes sense. The broadcasted data in my case would a
large array of 3D coordinates,

On a side note, how can I take the output from a reduce function? I can see
methods to write it to a given output, but is it possible to retrieve the
reduced result back to the program - like a double value representing the
average in the previous example.


On Tue, Feb 16, 2016 at 3:47 PM, Fabian Hueske  wrote:

> You can use so-called BroadcastSets to send any sufficiently small DataSet
> (such as a computed average) to any other function and use it there.
> However, in your case you'll end up with a data flow that branches (at the
> source) and merges again (when the average is send to the second map).
> Such patterns can cause deadlocks and can therefore not be pipelined which
> means that the data before the branch is written to disk and read again.
> In your case it might be even better to read the data twice instead of
> reading, writing, and reading it.
>
> Fabian
>
> 2016-02-16 21:15 GMT+01:00 Saliya Ekanayake :
>
>> I looked at the samples and I think what you meant is clear, but I didn't
>> find a solution for my need. In my case, I want to use the result from
>> first map operation before I can apply the second map on the *same* data
>> set. For simplicity, let's say I've a bunch of short values represented as
>> my data set. Then I need to find their average, so I use a map and reduce.
>> Then I want to map these short values with another function, but it needs
>> that average computed in the beginning to work correctly.
>>
>> Is this possible without doing multiple reads of the input data to create
>> the same dataset?
>>
>> Thank you,
>> saliya
>>
>> On Tue, Feb 16, 2016 at 12:03 PM, Fabian Hueske 
>> wrote:
>>
>>> Yes, if you implement both maps in a single job, data is read once.
>>>
>>> 2016-02-16 15:53 GMT+01:00 Saliya Ekanayake :
>>>
 Fabian,

 I've a quick follow-up question on what you suggested. When streaming
 the same data through different maps, were you implying that everything
 goes as single job in Flink, so data read happens only once?

 Thanks,
 Saliya

 On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske 
 wrote:

> It is not possible to "pin" data sets in memory, yet.
> However, you can stream the same data set through two different
> mappers at the same time.
>
> For instance you can have a job like:
>
>  /---> Map 1 --> SInk1
> Source --<
>  \---> Map 2 --> SInk2
>
> and execute it at once.
> For that you define you data flow and call execute once after all
> sinks have been created.
>
> Best, Fabian
>
> 2016-02-15 21:32 GMT+01:00 Saliya Ekanayake :
>
>> Fabian,
>>
>> count() was just an example. What I would like to do is say run two
>> map operations on the dataset (ds). Each map will have it's own 
>> reduction,
>> so is there a way to avoid creating two jobs for such scenario?
>>
>> The reason is, reading these binary matrices are expensive. In our
>> current MPI implementation, I am using memory maps for faster loading and
>> reuse.
>>
>> Thank you,
>> Saliya
>>
>> On Mon, Feb 15, 2016 at 3:15 PM, Fabian Hueske 
>> wrote:
>>
>>> Hi,
>>>
>>> it looks like you are executing two distinct Flink jobs.
>>> DataSet.count() triggers the execution of a new job. If you have an
>>> execute() call in your program, this will lead to two Flink jobs being
>>> executed.
>>> It is not possible to share state among these jobs.
>>>
>>> Maybe you should add a custom count implementation (using a
>>> ReduceFunction) which is executed in the same program as the other
>>> ReduceFunction.
>>>
>>> Best, Fabian
>>>
>>>
>>>
>>> 2016-02-15 21:05 GMT+01:00 Saliya Ekanayake :
>>>
 Hi,

 I see that an InputFormat's open() and nextRecord() methods get
 called for each terminal operation on a given dataset using that 
 particular
 InputFormat. Is it possible to avoid this - possibly using some caching
 technique in Flink?

 For example, I've some code like below and I see for both the last
 two statements (reduce() and count()) the above methods in the input 
 format
 get called. Btw. this is a custom input format I wrote to represent a
 binary matrix stored as Short values.

 ShortMatrixInputFormat smif = new ShortMatrixInputFormat();

 DataSet ds = env.createInput(smif, 
 BasicArrayTypeInfo.SHORT_ARRAY_TYPE_INFO);

 MapOperator op = ds.map(...)

 *op.reduce(...)*

 *op.count(...)*


 Thank you,
 Saliya
 --
 Saliya Ekanayake
 Ph.D. Candidate | Research Assistant
>>>

Re: Read once input data?

2016-02-16 Thread Fabian Hueske

You can use so-called BroadcastSets to send any sufficiently small DataSet
(such as a computed average) to any other function and use it there.
However, in your case you'll end up with a data flow that branches (at the
source) and merges again (when the average is send to the second map).
Such patterns can cause deadlocks and can therefore not be pipelined which
means that the data before the branch is written to disk and read again.
In your case it might be even better to read the data twice instead of
reading, writing, and reading it.

Fabian

2016-02-16 21:15 GMT+01:00 Saliya Ekanayake :

> I looked at the samples and I think what you meant is clear, but I didn't
> find a solution for my need. In my case, I want to use the result from
> first map operation before I can apply the second map on the *same* data
> set. For simplicity, let's say I've a bunch of short values represented as
> my data set. Then I need to find their average, so I use a map and reduce.
> Then I want to map these short values with another function, but it needs
> that average computed in the beginning to work correctly.
>
> Is this possible without doing multiple reads of the input data to create
> the same dataset?
>
> Thank you,
> saliya
>
> On Tue, Feb 16, 2016 at 12:03 PM, Fabian Hueske  wrote:
>
>> Yes, if you implement both maps in a single job, data is read once.
>>
>> 2016-02-16 15:53 GMT+01:00 Saliya Ekanayake :
>>
>>> Fabian,
>>>
>>> I've a quick follow-up question on what you suggested. When streaming
>>> the same data through different maps, were you implying that everything
>>> goes as single job in Flink, so data read happens only once?
>>>
>>> Thanks,
>>> Saliya
>>>
>>> On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske 
>>> wrote:
>>>
 It is not possible to "pin" data sets in memory, yet.
 However, you can stream the same data set through two different mappers
 at the same time.

 For instance you can have a job like:

  /---> Map 1 --> SInk1
 Source --<
  \---> Map 2 --> SInk2

 and execute it at once.
 For that you define you data flow and call execute once after all sinks
 have been created.

 Best, Fabian

 2016-02-15 21:32 GMT+01:00 Saliya Ekanayake :

> Fabian,
>
> count() was just an example. What I would like to do is say run two
> map operations on the dataset (ds). Each map will have it's own reduction,
> so is there a way to avoid creating two jobs for such scenario?
>
> The reason is, reading these binary matrices are expensive. In our
> current MPI implementation, I am using memory maps for faster loading and
> reuse.
>
> Thank you,
> Saliya
>
> On Mon, Feb 15, 2016 at 3:15 PM, Fabian Hueske 
> wrote:
>
>> Hi,
>>
>> it looks like you are executing two distinct Flink jobs.
>> DataSet.count() triggers the execution of a new job. If you have an
>> execute() call in your program, this will lead to two Flink jobs being
>> executed.
>> It is not possible to share state among these jobs.
>>
>> Maybe you should add a custom count implementation (using a
>> ReduceFunction) which is executed in the same program as the other
>> ReduceFunction.
>>
>> Best, Fabian
>>
>>
>>
>> 2016-02-15 21:05 GMT+01:00 Saliya Ekanayake :
>>
>>> Hi,
>>>
>>> I see that an InputFormat's open() and nextRecord() methods get
>>> called for each terminal operation on a given dataset using that 
>>> particular
>>> InputFormat. Is it possible to avoid this - possibly using some caching
>>> technique in Flink?
>>>
>>> For example, I've some code like below and I see for both the last
>>> two statements (reduce() and count()) the above methods in the input 
>>> format
>>> get called. Btw. this is a custom input format I wrote to represent a
>>> binary matrix stored as Short values.
>>>
>>> ShortMatrixInputFormat smif = new ShortMatrixInputFormat();
>>>
>>> DataSet ds = env.createInput(smif, 
>>> BasicArrayTypeInfo.SHORT_ARRAY_TYPE_INFO);
>>>
>>> MapOperator op = ds.map(...)
>>>
>>> *op.reduce(...)*
>>>
>>> *op.count(...)*
>>>
>>>
>>> Thank you,
>>> Saliya
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>>
>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>


>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Cen

Re: Read once input data?

2016-02-16 Thread Saliya Ekanayake

I looked at the samples and I think what you meant is clear, but I didn't
find a solution for my need. In my case, I want to use the result from
first map operation before I can apply the second map on the *same* data
set. For simplicity, let's say I've a bunch of short values represented as
my data set. Then I need to find their average, so I use a map and reduce.
Then I want to map these short values with another function, but it needs
that average computed in the beginning to work correctly.

Is this possible without doing multiple reads of the input data to create
the same dataset?

Thank you,
saliya

On Tue, Feb 16, 2016 at 12:03 PM, Fabian Hueske  wrote:

> Yes, if you implement both maps in a single job, data is read once.
>
> 2016-02-16 15:53 GMT+01:00 Saliya Ekanayake :
>
>> Fabian,
>>
>> I've a quick follow-up question on what you suggested. When streaming the
>> same data through different maps, were you implying that everything goes as
>> single job in Flink, so data read happens only once?
>>
>> Thanks,
>> Saliya
>>
>> On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske  wrote:
>>
>>> It is not possible to "pin" data sets in memory, yet.
>>> However, you can stream the same data set through two different mappers
>>> at the same time.
>>>
>>> For instance you can have a job like:
>>>
>>>  /---> Map 1 --> SInk1
>>> Source --<
>>>  \---> Map 2 --> SInk2
>>>
>>> and execute it at once.
>>> For that you define you data flow and call execute once after all sinks
>>> have been created.
>>>
>>> Best, Fabian
>>>
>>> 2016-02-15 21:32 GMT+01:00 Saliya Ekanayake :
>>>
 Fabian,

 count() was just an example. What I would like to do is say run two map
 operations on the dataset (ds). Each map will have it's own reduction, so
 is there a way to avoid creating two jobs for such scenario?

 The reason is, reading these binary matrices are expensive. In our
 current MPI implementation, I am using memory maps for faster loading and
 reuse.

 Thank you,
 Saliya

 On Mon, Feb 15, 2016 at 3:15 PM, Fabian Hueske 
 wrote:

> Hi,
>
> it looks like you are executing two distinct Flink jobs.
> DataSet.count() triggers the execution of a new job. If you have an
> execute() call in your program, this will lead to two Flink jobs being
> executed.
> It is not possible to share state among these jobs.
>
> Maybe you should add a custom count implementation (using a
> ReduceFunction) which is executed in the same program as the other
> ReduceFunction.
>
> Best, Fabian
>
>
>
> 2016-02-15 21:05 GMT+01:00 Saliya Ekanayake :
>
>> Hi,
>>
>> I see that an InputFormat's open() and nextRecord() methods get
>> called for each terminal operation on a given dataset using that 
>> particular
>> InputFormat. Is it possible to avoid this - possibly using some caching
>> technique in Flink?
>>
>> For example, I've some code like below and I see for both the last
>> two statements (reduce() and count()) the above methods in the input 
>> format
>> get called. Btw. this is a custom input format I wrote to represent a
>> binary matrix stored as Short values.
>>
>> ShortMatrixInputFormat smif = new ShortMatrixInputFormat();
>>
>> DataSet ds = env.createInput(smif, 
>> BasicArrayTypeInfo.SHORT_ARRAY_TYPE_INFO);
>>
>> MapOperator op = ds.map(...)
>>
>> *op.reduce(...)*
>>
>> *op.count(...)*
>>
>>
>> Thank you,
>> Saliya
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>>
>
>


 --
 Saliya Ekanayake
 Ph.D. Candidate | Research Assistant
 School of Informatics and Computing | Digital Science Center
 Indiana University, Bloomington
 Cell 812-391-4914
 http://saliya.org

>>>
>>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>>
>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Re: writeAsCSV with partitionBy

2016-02-16 Thread Fabian Hueske

Yes, you're right. I did not understand your question correctly.

Right now, Flink does not feature an output format that writes records to
output files depending on a key attribute.
You would need to implement such an output format yourself and append it as
follows:

val data = ...
data.partitionByHash(0) // partition to send all records with the same key
to the same machine
  .output(new YourOutputFormat())

In case of many distinct keys, you would need to limit the number of open
file handles. The OF will be easier to implement, if you do a
sortPartition(0, Order.ASCENDING) before the output format to sort the data
by key.

Cheers, Fabian




2016-02-16 19:52 GMT+01:00 Srikanth :

> Fabian,
>
> Not sure if we are on the same page. If I do something like below code, it
> will groupby field 0 and each task will write a separate part file in
> parallel.
>
> val sink = data1.join(data2)
> .where(1).equalTo(0) { ((l,r) => ( l._3, r._3) ) }
> .partitionByHash(0)
> .writeAsCsv(pathBase + "output/test", rowDelimiter="\n",
> fieldDelimiter="\t" , WriteMode.OVERWRITE)
>
> This will create folder ./output/test/<1,2,3,4...>
>
> But what I was looking for is Hive style partitionBy that will output with
> folder structure
>
>./output/field0=1/file
>./output/field0=2/file
>./output/field0=3/file
>./output/field0=4/file
>
> Assuming field0 is Int and has unique values 1,2,3&4.
>
> Srikanth
>
>
> On Mon, Feb 15, 2016 at 6:20 AM, Fabian Hueske  wrote:
>
>> Hi Srikanth,
>>
>> DataSet.partitionBy() will partition the data on the declared partition
>> fields.
>> If you append a DataSink with the same parallelism as the partition
>> operator, the data will be written out with the defined partitioning.
>> It should be possible to achieve the behavior you described using
>> DataSet.partitionByHash() or partitionByRange().
>>
>> Best, Fabian
>>
>>
>> 2016-02-12 20:53 GMT+01:00 Srikanth :
>>
>>> Hello,
>>>
>>>
>>>
>>> Is there a Hive(or Spark dataframe) partitionBy equivalent in Flink?
>>>
>>> I'm looking to save output as CSV files partitioned by two columns(date
>>> and hour).
>>>
>>> The partitionBy dataset API is more to partition the data based on a
>>> column for further processing.
>>>
>>>
>>>
>>> I'm thinking there is no direct API to do this. But what will be the
>>> best way of achieving this?
>>>
>>>
>>>
>>> Srikanth
>>>
>>>
>>>
>>
>>
>

Re: writeAsCSV with partitionBy

2016-02-16 Thread Srikanth

Fabian,

Not sure if we are on the same page. If I do something like below code, it
will groupby field 0 and each task will write a separate part file in
parallel.

val sink = data1.join(data2)
.where(1).equalTo(0) { ((l,r) => ( l._3, r._3) ) }
.partitionByHash(0)
.writeAsCsv(pathBase + "output/test", rowDelimiter="\n",
fieldDelimiter="\t" , WriteMode.OVERWRITE)

This will create folder ./output/test/<1,2,3,4...>

But what I was looking for is Hive style partitionBy that will output with
folder structure

   ./output/field0=1/file
   ./output/field0=2/file
   ./output/field0=3/file
   ./output/field0=4/file

Assuming field0 is Int and has unique values 1,2,3&4.

Srikanth


On Mon, Feb 15, 2016 at 6:20 AM, Fabian Hueske  wrote:

> Hi Srikanth,
>
> DataSet.partitionBy() will partition the data on the declared partition
> fields.
> If you append a DataSink with the same parallelism as the partition
> operator, the data will be written out with the defined partitioning.
> It should be possible to achieve the behavior you described using
> DataSet.partitionByHash() or partitionByRange().
>
> Best, Fabian
>
>
> 2016-02-12 20:53 GMT+01:00 Srikanth :
>
>> Hello,
>>
>>
>>
>> Is there a Hive(or Spark dataframe) partitionBy equivalent in Flink?
>>
>> I'm looking to save output as CSV files partitioned by two columns(date
>> and hour).
>>
>> The partitionBy dataset API is more to partition the data based on a
>> column for further processing.
>>
>>
>>
>> I'm thinking there is no direct API to do this. But what will be the best
>> way of achieving this?
>>
>>
>>
>> Srikanth
>>
>>
>>
>
>

Re: Read once input data?

2016-02-16 Thread Fabian Hueske

Yes, if you implement both maps in a single job, data is read once.

2016-02-16 15:53 GMT+01:00 Saliya Ekanayake :

> Fabian,
>
> I've a quick follow-up question on what you suggested. When streaming the
> same data through different maps, were you implying that everything goes as
> single job in Flink, so data read happens only once?
>
> Thanks,
> Saliya
>
> On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske  wrote:
>
>> It is not possible to "pin" data sets in memory, yet.
>> However, you can stream the same data set through two different mappers
>> at the same time.
>>
>> For instance you can have a job like:
>>
>>  /---> Map 1 --> SInk1
>> Source --<
>>  \---> Map 2 --> SInk2
>>
>> and execute it at once.
>> For that you define you data flow and call execute once after all sinks
>> have been created.
>>
>> Best, Fabian
>>
>> 2016-02-15 21:32 GMT+01:00 Saliya Ekanayake :
>>
>>> Fabian,
>>>
>>> count() was just an example. What I would like to do is say run two map
>>> operations on the dataset (ds). Each map will have it's own reduction, so
>>> is there a way to avoid creating two jobs for such scenario?
>>>
>>> The reason is, reading these binary matrices are expensive. In our
>>> current MPI implementation, I am using memory maps for faster loading and
>>> reuse.
>>>
>>> Thank you,
>>> Saliya
>>>
>>> On Mon, Feb 15, 2016 at 3:15 PM, Fabian Hueske 
>>> wrote:
>>>
 Hi,

 it looks like you are executing two distinct Flink jobs.
 DataSet.count() triggers the execution of a new job. If you have an
 execute() call in your program, this will lead to two Flink jobs being
 executed.
 It is not possible to share state among these jobs.

 Maybe you should add a custom count implementation (using a
 ReduceFunction) which is executed in the same program as the other
 ReduceFunction.

 Best, Fabian



 2016-02-15 21:05 GMT+01:00 Saliya Ekanayake :

> Hi,
>
> I see that an InputFormat's open() and nextRecord() methods get called
> for each terminal operation on a given dataset using that particular
> InputFormat. Is it possible to avoid this - possibly using some caching
> technique in Flink?
>
> For example, I've some code like below and I see for both the last two
> statements (reduce() and count()) the above methods in the input format 
> get
> called. Btw. this is a custom input format I wrote to represent a binary
> matrix stored as Short values.
>
> ShortMatrixInputFormat smif = new ShortMatrixInputFormat();
>
> DataSet ds = env.createInput(smif, 
> BasicArrayTypeInfo.SHORT_ARRAY_TYPE_INFO);
>
> MapOperator op = ds.map(...)
>
> *op.reduce(...)*
>
> *op.count(...)*
>
>
> Thank you,
> Saliya
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>


>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>>
>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>

Re: Read once input data?

2016-02-16 Thread Saliya Ekanayake

Fabian,

I've a quick follow-up question on what you suggested. When streaming the
same data through different maps, were you implying that everything goes as
single job in Flink, so data read happens only once?

Thanks,
Saliya

On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske  wrote:

> It is not possible to "pin" data sets in memory, yet.
> However, you can stream the same data set through two different mappers at
> the same time.
>
> For instance you can have a job like:
>
>  /---> Map 1 --> SInk1
> Source --<
>  \---> Map 2 --> SInk2
>
> and execute it at once.
> For that you define you data flow and call execute once after all sinks
> have been created.
>
> Best, Fabian
>
> 2016-02-15 21:32 GMT+01:00 Saliya Ekanayake :
>
>> Fabian,
>>
>> count() was just an example. What I would like to do is say run two map
>> operations on the dataset (ds). Each map will have it's own reduction, so
>> is there a way to avoid creating two jobs for such scenario?
>>
>> The reason is, reading these binary matrices are expensive. In our
>> current MPI implementation, I am using memory maps for faster loading and
>> reuse.
>>
>> Thank you,
>> Saliya
>>
>> On Mon, Feb 15, 2016 at 3:15 PM, Fabian Hueske  wrote:
>>
>>> Hi,
>>>
>>> it looks like you are executing two distinct Flink jobs.
>>> DataSet.count() triggers the execution of a new job. If you have an
>>> execute() call in your program, this will lead to two Flink jobs being
>>> executed.
>>> It is not possible to share state among these jobs.
>>>
>>> Maybe you should add a custom count implementation (using a
>>> ReduceFunction) which is executed in the same program as the other
>>> ReduceFunction.
>>>
>>> Best, Fabian
>>>
>>>
>>>
>>> 2016-02-15 21:05 GMT+01:00 Saliya Ekanayake :
>>>
 Hi,

 I see that an InputFormat's open() and nextRecord() methods get called
 for each terminal operation on a given dataset using that particular
 InputFormat. Is it possible to avoid this - possibly using some caching
 technique in Flink?

 For example, I've some code like below and I see for both the last two
 statements (reduce() and count()) the above methods in the input format get
 called. Btw. this is a custom input format I wrote to represent a binary
 matrix stored as Short values.

 ShortMatrixInputFormat smif = new ShortMatrixInputFormat();

 DataSet ds = env.createInput(smif, 
 BasicArrayTypeInfo.SHORT_ARRAY_TYPE_INFO);

 MapOperator op = ds.map(...)

 *op.reduce(...)*

 *op.count(...)*


 Thank you,
 Saliya
 --
 Saliya Ekanayake
 Ph.D. Candidate | Research Assistant
 School of Informatics and Computing | Digital Science Center
 Indiana University, Bloomington
 Cell 812-391-4914
 http://saliya.org

>>>
>>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>>
>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Re: inconsistent results

2016-02-16 Thread Aljoscha Krettek

Hi,
I don’t know about the results but one problem I can identify is this snipped:

groupBy(0).sum(2).max(2)

The max(2) here is a non-parallel operation since it finds the max over all 
elements, not grouped by key. If you want the max to also be per-key you have 
to use

groupBy(0).sum(2).andMax(2)

I hope that helps.

Cheers,
Aljoscha
> On 16 Feb 2016, at 12:30, Lydia Ickler  wrote:
> 
> Hi all,
> 
> I am doing a Power Iteration. When I execute it with one machine everything 
> works great. 
> When I use more than one (in my case 5) then I get different results.
> I attached my code. 
> Within the execution plan I can see that a GroupReduce part is executed only 
> by one worker (see screenshot) - is this causing the issue?
> 
> Best regards, 
> Lydia
> 
> 
>

Re: Flink 1.0.0 Release Candidate 0: Please help testing

2016-02-16 Thread Stephan Ewen

Found one blocker issue during testing:

  - Watermark generators accept negative watermarks (FLINK-3415)

On Mon, Feb 15, 2016 at 8:47 PM, Robert Metzger  wrote:

> Hi,
>
> I've now created a "preview RC" for the upcoming 1.0.0 release.
> There are still some blocking issues and important pull requests to be
> merged but nevertheless I would like to start testing Flink for the release.
>
> In past major releases, we needed to create many release candidates, often
> for fixing just some small issues. I would like to speed up the release
> process by collecting as many issues as possible now with the RC0. Once
> these issues are resolved, we can start voting with the RC1.
>
> We have a Wiki page
> https://cwiki.apache.org/confluence/display/FLINK/Releasing containing
> some common release verification tasks.
>
> Also, production users are encouraged to participate in the release
> verification process.
>
> Here are the preview binaries located:
> http://people.apache.org/~rmetzger/flink-1.0.0-rc0/
> This is the staging repository:
> https://repository.apache.org/content/repositories/orgapacheflink-1062
>
>
> To use the release candidate in an existing pom project, set the Flink
> version to 1.0.0 and the repository URL to
> https://repository.apache.org/content/repositories/orgapacheflink-1062.
> The pom should look like this:
>
> 
>UTF-8
>1.0.0
> 
>
> 
>
>   flink.release-staging
>   Apache Development Snapshot Repository
>   
> https://repository.apache.org/content/repositories/orgapacheflink-1062
>   true
>   false
>
> 
>
>
>
>
> Let me start a list of issues we need to resolve:
> - Ensure the release is build with maven < 3.3 (the RC0 has been build
> with Maven 3.3, so the Guava shading is not done properly.)
> - flink-cep depends on Scala, but the artifact doesn't have a scala
> version suffix.
> - the flink-quickstart-java is referring to scala 2.11 by default.
>
> Please add more issues to the list 
>
>
>
>
>

Availability for the ElasticSearch 2 streaming connector

2016-02-16 Thread Vieru, Mihail

Hi,

in reference to this ticket https://issues.apache.org/jira/browse/FLINK-3115
when do you think that an ElasticSearch 2 streaming connector will become
available? Will it make it for the 1.0 release?

That would be great, as we are planning to use that particular version of
ElasticSearch in the very near future.

Best regards,
Mihail

Re: Regarding Concurrent Modification Exception

2016-02-16 Thread Biplob Biswas

Hi,

No, we don't start a flink job inside another job, although the job
creation was done in a loop, but only when one job is finished the next job
started after cleanup. And we didn't get this exception on my local flink
installation, it appears when i run on the cluster.

Thanks & Regards
Biplob Biswas

On Mon, Feb 15, 2016 at 12:25 PM, Fabian Hueske  wrote:

> Hi,
>
> This stacktrace looks really suspicious.
> It includes classes from the submission client (CLIClient), optimizer
> (JobGraphGenerator), and runtime (KryoSerializer).
>
> Is it possible that you try to start a new Flink job inside another job?
> This would not work.
>
> Best, Fabian
>

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-16 Thread Stephan Ewen

Hi!

As a bit of background: ZooKeeper allows you only to store very small data.
We hence persist only the changing checkpoint metadata in ZooKeeper.

To recover a job, some constant data is also needed: The JobGraph, and the
JarFiles. These cannot go to ZooKeeper, but need to go to a reliable
storage (such as HDFS, S3, or a mounted file system).

Greetings,
Stephan


On Tue, Feb 16, 2016 at 11:30 AM, Stefano Baghino <
stefano.bagh...@radicalbit.io> wrote:

> Ok, simply turning up HDFS on the cluster and using it as the state
> backend fixed the issue. Thank you both for the help!
>
> On Mon, Feb 15, 2016 at 5:45 PM, Stefano Baghino <
> stefano.bagh...@radicalbit.io> wrote:
>
>> You can find the log of the recovering job manager here:
>> https://gist.github.com/stefanobaghino/ae28f00efb6bdd907b42
>>
>> Basically, what Ufuk said happened: the job manager tried to fill in for
>> the lost one but couldn't find the actual data because it looked it up
>> locally whereas due to my configuration it was actually stored on another
>> machine.
>>
>> Thanks for the help, it's really been precious!
>>
>> On Mon, Feb 15, 2016 at 5:24 PM, Maximilian Michels 
>> wrote:
>>
>>> Hi Stefano,
>>>
>>> A correction from my side: You don't need to set the execution retries
>>> for HA because a new JobManager will automatically take over and
>>> resubmit all jobs which were recovered from the storage directory you
>>> set up. The number of execution retries applies only to jobs which are
>>> restarted due to a TaskManager failure.
>>>
>>> It would be great if you could supply some logs.
>>>
>>> Cheers,
>>> Max
>>>
>>>
>>> On Mon, Feb 15, 2016 at 1:45 PM, Maximilian Michels 
>>> wrote:
>>> > Hi Stefano,
>>> >
>>> > That is true. The documentation doesn't mention that. Just wanted to
>>> > point you to the documentation if anything else needs to be
>>> > configured. We will update it.
>>> >
>>> > Instead of setting the number of execution retries on the
>>> > StreamExecutionEnvironment, you may also set
>>> > "execution-retries.default" in the flink-conf.yaml. Let us know if
>>> > that fixes your setup.
>>> >
>>> > Cheers,
>>> > Max
>>> >
>>> > On Mon, Feb 15, 2016 at 1:41 PM, Stefano Baghino
>>> >  wrote:
>>> >> Hi Maximilian,
>>> >>
>>> >> thank you for the reply. I've checked out the documentation before
>>> running
>>> >> my tests (I'm not expert enough to not read the docs ;)) but it
>>> doesn't
>>> >> mention some specific requirement regarding the execution retries,
>>> I'll
>>> >> check it out, thank!
>>> >>
>>> >> On Mon, Feb 15, 2016 at 12:51 PM, Maximilian Michels 
>>> wrote:
>>> >>>
>>> >>> Hi Stefano,
>>> >>>
>>> >>> The Job should stop temporarily but then be resumed by the new
>>> >>> JobManager. Have you increased the number of execution retries?
>>> AFAIK,
>>> >>> it is set to 0 by default. This will not re-run the job, even in HA
>>> >>> mode. You can enable it on the StreamExecutionEnvironment.
>>> >>>
>>> >>> Otherwise, you have probably already found the documentation:
>>> >>>
>>> >>>
>>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#configuration
>>> >>>
>>> >>> Cheers,
>>> >>> Max
>>> >>>
>>> >>> On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino
>>> >>>  wrote:
>>> >>> > Hello everyone,
>>> >>> >
>>> >>> > last week I've ran some tests with Apache ZooKeeper to get a grip
>>> on
>>> >>> > Flink
>>> >>> > HA features. My tests went bad so far and I can't sort out the
>>> reason.
>>> >>> >
>>> >>> > My latest tests involved Flink 0.10.2, ran as a standalone cluster
>>> with
>>> >>> > 3
>>> >>> > masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6)
>>> >>> > ensemble.
>>> >>> > I've started ZooKeeper on each machine, tested it's availability
>>> and
>>> >>> > then
>>> >>> > started the Flink cluster. Since there's no reliable distributed
>>> >>> > filesystem
>>> >>> > on the cluster, I had to use the local file system as the state
>>> backend.
>>> >>> >
>>> >>> > I then submitted a very simple streaming job that writes the
>>> timestamp
>>> >>> > on a
>>> >>> > text file on the local file system each second and then went on to
>>> kill
>>> >>> > the
>>> >>> > process running the job manager to verify that another job manager
>>> takes
>>> >>> > over. However, the job just stopped. I still have to perform some
>>> checks
>>> >>> > on
>>> >>> > the handover to the new job manager, but before digging deeper I
>>> wanted
>>> >>> > to
>>> >>> > ask if my expectation of having the job going despite the job
>>> manager
>>> >>> > failure is unreasonable.
>>> >>> >
>>> >>> > Thanks in advance.
>>> >>> >
>>> >>> > --
>>> >>> > BR,
>>> >>> > Stefano Baghino
>>> >>> >
>>> >>> > Software Engineer @ Radicalbit
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> BR,
>>> >> Stefano Baghino
>>> >>
>>> >> Software Engineer @ Radicalbit
>>>
>>
>>
>>
>> --
>> BR,
>> Stefano Baghino
>>
>> Software Engineer @ Radicalbit
>>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer

Re: Read once input data?

2016-02-16 Thread Flavio Pompermaier

I also have a couple of use cases where the pin data sets in memory feature
would help a lot ;)

On Mon, Feb 15, 2016 at 10:18 PM, Saliya Ekanayake 
wrote:

> Thanks, I'll check this.
>
> Saliya
>
> On Mon, Feb 15, 2016 at 4:08 PM, Fabian Hueske  wrote:
>
>> I would have a look at the example programs in our code base:
>>
>>
>> https://github.com/apache/flink/tree/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java
>>
>> Best, Fabian
>>
>> 2016-02-15 22:03 GMT+01:00 Saliya Ekanayake :
>>
>>> Thank you, Fabian.
>>>
>>> Any chance you might have an example on how to define a data flow with
>>> Flink?
>>>
>>>
>>>
>>> On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske 
>>> wrote:
>>>
 It is not possible to "pin" data sets in memory, yet.
 However, you can stream the same data set through two different mappers
 at the same time.

 For instance you can have a job like:

  /---> Map 1 --> SInk1
 Source --<
  \---> Map 2 --> SInk2

 and execute it at once.
 For that you define you data flow and call execute once after all sinks
 have been created.

 Best, Fabian

 2016-02-15 21:32 GMT+01:00 Saliya Ekanayake :

> Fabian,
>
> count() was just an example. What I would like to do is say run two
> map operations on the dataset (ds). Each map will have it's own reduction,
> so is there a way to avoid creating two jobs for such scenario?
>
> The reason is, reading these binary matrices are expensive. In our
> current MPI implementation, I am using memory maps for faster loading and
> reuse.
>
> Thank you,
> Saliya
>
> On Mon, Feb 15, 2016 at 3:15 PM, Fabian Hueske 
> wrote:
>
>> Hi,
>>
>> it looks like you are executing two distinct Flink jobs.
>> DataSet.count() triggers the execution of a new job. If you have an
>> execute() call in your program, this will lead to two Flink jobs being
>> executed.
>> It is not possible to share state among these jobs.
>>
>> Maybe you should add a custom count implementation (using a
>> ReduceFunction) which is executed in the same program as the other
>> ReduceFunction.
>>
>> Best, Fabian
>>
>>
>>
>> 2016-02-15 21:05 GMT+01:00 Saliya Ekanayake :
>>
>>> Hi,
>>>
>>> I see that an InputFormat's open() and nextRecord() methods get
>>> called for each terminal operation on a given dataset using that 
>>> particular
>>> InputFormat. Is it possible to avoid this - possibly using some caching
>>> technique in Flink?
>>>
>>> For example, I've some code like below and I see for both the last
>>> two statements (reduce() and count()) the above methods in the input 
>>> format
>>> get called. Btw. this is a custom input format I wrote to represent a
>>> binary matrix stored as Short values.
>>>
>>> ShortMatrixInputFormat smif = new ShortMatrixInputFormat();
>>>
>>> DataSet ds = env.createInput(smif, 
>>> BasicArrayTypeInfo.SHORT_ARRAY_TYPE_INFO);
>>>
>>> MapOperator op = ds.map(...)
>>>
>>> *op.reduce(...)*
>>>
>>> *op.count(...)*
>>>
>>>
>>> Thank you,
>>> Saliya
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>>
>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>


>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>>
>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-16 Thread Stefano Baghino

Ok, simply turning up HDFS on the cluster and using it as the state backend
fixed the issue. Thank you both for the help!

On Mon, Feb 15, 2016 at 5:45 PM, Stefano Baghino <
stefano.bagh...@radicalbit.io> wrote:

> You can find the log of the recovering job manager here:
> https://gist.github.com/stefanobaghino/ae28f00efb6bdd907b42
>
> Basically, what Ufuk said happened: the job manager tried to fill in for
> the lost one but couldn't find the actual data because it looked it up
> locally whereas due to my configuration it was actually stored on another
> machine.
>
> Thanks for the help, it's really been precious!
>
> On Mon, Feb 15, 2016 at 5:24 PM, Maximilian Michels 
> wrote:
>
>> Hi Stefano,
>>
>> A correction from my side: You don't need to set the execution retries
>> for HA because a new JobManager will automatically take over and
>> resubmit all jobs which were recovered from the storage directory you
>> set up. The number of execution retries applies only to jobs which are
>> restarted due to a TaskManager failure.
>>
>> It would be great if you could supply some logs.
>>
>> Cheers,
>> Max
>>
>>
>> On Mon, Feb 15, 2016 at 1:45 PM, Maximilian Michels 
>> wrote:
>> > Hi Stefano,
>> >
>> > That is true. The documentation doesn't mention that. Just wanted to
>> > point you to the documentation if anything else needs to be
>> > configured. We will update it.
>> >
>> > Instead of setting the number of execution retries on the
>> > StreamExecutionEnvironment, you may also set
>> > "execution-retries.default" in the flink-conf.yaml. Let us know if
>> > that fixes your setup.
>> >
>> > Cheers,
>> > Max
>> >
>> > On Mon, Feb 15, 2016 at 1:41 PM, Stefano Baghino
>> >  wrote:
>> >> Hi Maximilian,
>> >>
>> >> thank you for the reply. I've checked out the documentation before
>> running
>> >> my tests (I'm not expert enough to not read the docs ;)) but it doesn't
>> >> mention some specific requirement regarding the execution retries, I'll
>> >> check it out, thank!
>> >>
>> >> On Mon, Feb 15, 2016 at 12:51 PM, Maximilian Michels 
>> wrote:
>> >>>
>> >>> Hi Stefano,
>> >>>
>> >>> The Job should stop temporarily but then be resumed by the new
>> >>> JobManager. Have you increased the number of execution retries? AFAIK,
>> >>> it is set to 0 by default. This will not re-run the job, even in HA
>> >>> mode. You can enable it on the StreamExecutionEnvironment.
>> >>>
>> >>> Otherwise, you have probably already found the documentation:
>> >>>
>> >>>
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#configuration
>> >>>
>> >>> Cheers,
>> >>> Max
>> >>>
>> >>> On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino
>> >>>  wrote:
>> >>> > Hello everyone,
>> >>> >
>> >>> > last week I've ran some tests with Apache ZooKeeper to get a grip on
>> >>> > Flink
>> >>> > HA features. My tests went bad so far and I can't sort out the
>> reason.
>> >>> >
>> >>> > My latest tests involved Flink 0.10.2, ran as a standalone cluster
>> with
>> >>> > 3
>> >>> > masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6)
>> >>> > ensemble.
>> >>> > I've started ZooKeeper on each machine, tested it's availability and
>> >>> > then
>> >>> > started the Flink cluster. Since there's no reliable distributed
>> >>> > filesystem
>> >>> > on the cluster, I had to use the local file system as the state
>> backend.
>> >>> >
>> >>> > I then submitted a very simple streaming job that writes the
>> timestamp
>> >>> > on a
>> >>> > text file on the local file system each second and then went on to
>> kill
>> >>> > the
>> >>> > process running the job manager to verify that another job manager
>> takes
>> >>> > over. However, the job just stopped. I still have to perform some
>> checks
>> >>> > on
>> >>> > the handover to the new job manager, but before digging deeper I
>> wanted
>> >>> > to
>> >>> > ask if my expectation of having the job going despite the job
>> manager
>> >>> > failure is unreasonable.
>> >>> >
>> >>> > Thanks in advance.
>> >>> >
>> >>> > --
>> >>> > BR,
>> >>> > Stefano Baghino
>> >>> >
>> >>> > Software Engineer @ Radicalbit
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> BR,
>> >> Stefano Baghino
>> >>
>> >> Software Engineer @ Radicalbit
>>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>



-- 
BR,
Stefano Baghino

Software Engineer @ Radicalbit

Re: where can get the summary changes between flink-1.0 and flink-0.10

where can get the summary changes between flink-1.0 and flink-0.10

streaming hdfs sub folders

Re: Read once input data?

Re: Read once input data?

Re: Read once input data?

Re: Read once input data?

Re: Read once input data?

Re: writeAsCSV with partitionBy

Re: writeAsCSV with partitionBy

Re: Read once input data?

Re: Read once input data?

Re: inconsistent results

Re: Flink 1.0.0 Release Candidate 0: Please help testing

Availability for the ElasticSearch 2 streaming connector

Re: Regarding Concurrent Modification Exception

Re: Issues testing Flink HA w/ ZooKeeper

Re: Read once input data?

Re: Issues testing Flink HA w/ ZooKeeper

19 matches

Site Navigation

Mail list logo

Footer information