from:"\"Fokko\""

[jira] [Updated] (FLINK-9749) Rework Bucketing Sink

2020-02-19 Thread Fokko Driesprong (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated FLINK-9749:

Description: 
The BucketingSink has a series of deficits at the moment.

Due to the long list of issues, I would suggest to add a new StreamingFileSink 
with a new and cleaner design
h3. Encoders, Parquet, ORC
 - It only efficiently supports row-wise data formats (avro, json, sequence 
files).
 - Efforts to add (columnar) compression for blocks of data is inefficient, 
because blocks cannot span checkpoints due to persistence-on-checkpoint.
 - The encoders are part of the {{flink-connector-filesystem project}}, rather 
than in orthogonal formats projects. This blows up the dependencies of the 
{{flink-connector-filesystem project}} project. As an example, the rolling file 
sink has dependencies on Hadoop and Avro, which messes up dependency management.

h3. Use of FileSystems
 - The BucketingSink works only on Hadoop's FileSystem abstraction not support 
Flink's own FileSystem abstraction and cannot work with the packaged S3, 
maprfs, and swift file systems
 - The sink hence needs Hadoop as a dependency
 - The sink relies on "trying out" whether truncation works, which requires 
write access to the users working directory
 - The sink relies on enumerating and counting files, rather than maintaining 
its own state, making less efficient

h3. Correctness and Efficiency on S3
 - The BucketingSink relies on strong consistency in the file enumeration, 
hence may work incorrectly on S3.
 - The BucketingSink relies on persisting streams at intermediate points. This 
is not working properly on S3, hence there may be data loss on S3.

h3. .valid-length companion file
 - The valid length file makes it hard for consumers of the data and should be 
dropped

We track this design in a series of sub issues.

  was:
The BucketingSink has a series of deficits at the moment.

Due to the long list of issues, I would suggest to add a new StreamingFileSink 
with a new and cleaner design

h3. Encoders, Parquet, ORC

 - It only efficiently supports row-wise data formats (avro, jso, sequence 
files.
 - Efforts to add (columnar) compression for blocks of data is inefficient, 
because blocks cannot span checkpoints due to persistence-on-checkpoint.
 - The encoders are part of the \{{flink-connector-filesystem project}}, rather 
than in orthogonal formats projects. This blows up the dependencies of the 
\{{flink-connector-filesystem project}} project. As an example, the rolling 
file sink has dependencies on Hadoop and Avro, which messes up dependency 
management.

h3. Use of FileSystems

 - The BucketingSink works only on Hadoop's FileSystem abstraction not support 
Flink's own FileSystem abstraction and cannot work with the packaged S3, 
maprfs, and swift file systems
 - The sink hence needs Hadoop as a dependency
 - The sink relies on "trying out" whether truncation works, which requires 
write access to the users working directory
 - The sink relies on enumerating and counting files, rather than maintaining 
its own state, making less efficient

h3. Correctness and Efficiency on S3
 - The BucketingSink relies on strong consistency in the file enumeration, 
hence may work incorrectly on S3.
 - The BucketingSink relies on persisting streams at intermediate points. This 
is not working properly on S3, hence there may be data loss on S3.

h3. .valid-length companion file
 - The valid length file makes it hard for consumers of the data and should be 
dropped


We track this design in a series of sub issues.


> Rework Bucketing Sink
> -
>
> Key: FLINK-9749
> URL: https://issues.apache.org/jira/browse/FLINK-9749
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / FileSystem
>Reporter: Stephan Ewen
>Assignee: Kostas Kloudas
>Priority: Major
>
> The BucketingSink has a series of deficits at the moment.
> Due to the long list of issues, I would suggest to add a new 
> StreamingFileSink with a new and cleaner design
> h3. Encoders, Parquet, ORC
>  - It only efficiently supports row-wise data formats (avro, json, sequence 
> files).
>  - Efforts to add (columnar) compression for blocks of data is inefficient, 
> because blocks cannot span checkpoints due to persistence-on-checkpoint.
>  - The encoders are part of the {{flink-connector-filesystem project}}, 
> rather than in orthogonal formats projects. This blows up the dependencies of 
> the {{flink-connector-filesystem project}} project. As an example, the 
> rolling file sink has dependencies on Hadoop and Avro, which messes up 
> dependency management.
> h3. Use of FileSystems
>  - The BucketingSink works only on Hadoop's

[jira] [Created] (FLINK-13797) Add missing format argument

2019-08-20 Thread Fokko Driesprong (Jira)

Fokko Driesprong created FLINK-13797:


 Summary: Add missing format argument
 Key: FLINK-13797
 URL: https://issues.apache.org/jira/browse/FLINK-13797
 Project: Flink
  Issue Type: Task
  Components: Deployment / Mesos
Affects Versions: 1.8.1
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (FLINK-13796) Remove unused variable

2019-08-20 Thread Fokko Driesprong (Jira)

Fokko Driesprong created FLINK-13796:


 Summary: Remove unused variable
 Key: FLINK-13796
 URL: https://issues.apache.org/jira/browse/FLINK-13796
 Project: Flink
  Issue Type: Task
  Components: Deployment / YARN
Affects Versions: 1.8.1
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (FLINK-12338) Update Apache Avro test to use try-with-resources

2019-04-26 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-12338:


 Summary: Update Apache Avro test to use try-with-resources
 Key: FLINK-12338
 URL: https://issues.apache.org/jira/browse/FLINK-12338
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.8.0
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong


Update Apache Avro test to use try-with-resources. Right now some resources 
aren't close at all. Having the try-with-resources increases readability of the 
code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12330) Add integrated Tox for ensuring compatibility with the python2/3 version

2019-04-25 Thread Fokko Driesprong (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826702#comment-16826702
 ] 

Fokko Driesprong commented on FLINK-12330:
--

My personal preference is using Docker for this level of abstraction. Tox is 
quite a heavyweight wrapper, it is quite a hassle to set up which creates a 
barrier for newcomers.

> Add integrated Tox for ensuring compatibility with the python2/3 version
> 
>
> Key: FLINK-12330
> URL: https://issues.apache.org/jira/browse/FLINK-12330
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Python
>Affects Versions: 1.9.0
>Reporter: sunjincheng
>Assignee: sunjincheng
>Priority: Major
>
> Add integrated Tox for ensuring compatibility with the python2/3 version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12330) Add integrated Tox for ensuring compatibility with the python2/3 version

2019-04-25 Thread Fokko Driesprong (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825818#comment-16825818
 ] 

Fokko Driesprong commented on FLINK-12330:
--

Python2 is EOL end this year. Many projects are dropping support. Personally, 
I'm doubtful if it worth the investment (and the added complexity) of adding 
Tox to the project.

> Add integrated Tox for ensuring compatibility with the python2/3 version
> 
>
> Key: FLINK-12330
> URL: https://issues.apache.org/jira/browse/FLINK-12330
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Python
>Affects Versions: 1.9.0
>Reporter: sunjincheng
>Priority: Major
>
> Add integrated Tox for ensuring compatibility with the python2/3 version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-12250) Rewrite assembleNewPartPath to let it return a new PartPath

2019-04-18 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-12250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated FLINK-12250:
-
Description: While debugging some code, I've noticed assembleNewPartPath 
does not really return a new path. Also rewrote the code a bit so the mutable 
inProgressPart is changed in a single place  (was: While debugging some code, 
I've noticed assembleNewPartPath does not really return a new path. Also 
rewrote the code a bit so the mutable inProgressPart is only changed in a 
single function. )

> Rewrite assembleNewPartPath to let it return a new PartPath
> ---
>
> Key: FLINK-12250
> URL: https://issues.apache.org/jira/browse/FLINK-12250
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.8.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>
> While debugging some code, I've noticed assembleNewPartPath does not really 
> return a new path. Also rewrote the code a bit so the mutable inProgressPart 
> is changed in a single place



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-12250) Rewrite assembleNewPartPath to let it return a new PartPath

2019-04-18 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-12250:


 Summary: Rewrite assembleNewPartPath to let it return a new 
PartPath
 Key: FLINK-12250
 URL: https://issues.apache.org/jira/browse/FLINK-12250
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.8.0
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong


While debugging some code, I've noticed assembleNewPartPath does not really 
return a new path. Also rewrote the code a bit so the mutable inProgressPart is 
only changed in a single function. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (FLINK-12225) Simplify the interface of the PartFileWriter

2019-04-17 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong reassigned FLINK-12225:


Assignee: Fokko Driesprong

> Simplify the interface of the PartFileWriter
> 
>
> Key: FLINK-12225
> URL: https://issues.apache.org/jira/browse/FLINK-12225
> Project: Flink
>  Issue Type: Improvement
>    Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Path is not being used, so no sense in including it in the interface



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-12225) Simplify the interface of the PartFileWriter

2019-04-17 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-12225:


 Summary: Simplify the interface of the PartFileWriter
 Key: FLINK-12225
 URL: https://issues.apache.org/jira/browse/FLINK-12225
 Project: Flink
  Issue Type: Improvement
Reporter: Fokko Driesprong


The Path is not being used, so no sense in including it in the interface



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (FLINK-11992) Update Apache Parquet 1.10.1

2019-03-22 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong closed FLINK-11992.

Resolution: Won't Fix

> Update Apache Parquet 1.10.1
> 
>
> Key: FLINK-11992
> URL: https://issues.apache.org/jira/browse/FLINK-11992
> Project: Flink
>  Issue Type: Bug
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11992) Update Apache Parquet 1.10.1

2019-03-21 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11992:


 Summary: Update Apache Parquet 1.10.1
 Key: FLINK-11992
 URL: https://issues.apache.org/jira/browse/FLINK-11992
 Project: Flink
  Issue Type: Bug
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (FLINK-11883) Harmonize the version of maven-shade-plugin

2019-03-13 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong closed FLINK-11883.

Resolution: Won't Fix

> Harmonize the version of maven-shade-plugin
> ---
>
> Key: FLINK-11883
> URL: https://issues.apache.org/jira/browse/FLINK-11883
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.7.2
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11911) KafkaTopicPartition is not a valid POJO

2019-03-13 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated FLINK-11911:
-
Affects Version/s: (was: 1.7.2)
   1.8.0

> KafkaTopicPartition is not a valid POJO
> ---
>
> Key: FLINK-11911
> URL: https://issues.apache.org/jira/browse/FLINK-11911
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.8.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.8.0
>
>
> KafkaTopicPartition is not a POJO, and therefore it cannot be serialized 
> efficiently. This is using the KafkaDeserializationSchema.
> When enforcing POJO's:
> ```
> java.lang.UnsupportedOperationException: Generic types have been disabled in 
> the ExecutionConfig and type 
> org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition is 
> treated as a generic type.
>   at 
> org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:86)
>   at 
> org.apache.flink.api.java.typeutils.TupleTypeInfo.createSerializer(TupleTypeInfo.java:107)
>   at 
> org.apache.flink.api.java.typeutils.TupleTypeInfo.createSerializer(TupleTypeInfo.java:52)
>   at 
> org.apache.flink.api.java.typeutils.ListTypeInfo.createSerializer(ListTypeInfo.java:102)
>   at 
> org.apache.flink.api.common.state.StateDescriptor.initializeSerializerUnlessSet(StateDescriptor.java:288)
>   at 
> org.apache.flink.runtime.state.DefaultOperatorStateBackend.getListState(DefaultOperatorStateBackend.java:289)
>   at 
> org.apache.flink.runtime.state.DefaultOperatorStateBackend.getUnionListState(DefaultOperatorStateBackend.java:219)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.initializeState(FlinkKafkaConsumerBase.java:856)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
>   at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:278)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>   at java.lang.Thread.run(Thread.java:748)
> ```
> And in the logs:
> ```
> 2019-03-13 16:41:28,217 INFO  
> org.apache.flink.api.java.typeutils.TypeExtractor - class 
> org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition 
> does not contain a setter for field topic
> 2019-03-13 16:41:28,221 INFO  
> org.apache.flink.api.java.typeutils.TypeExtractor - Class class 
> org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition 
> cannot be used as a POJO type because not all fields are valid POJO fields, 
> and must be processed as GenericType. Please read the Flink documentation on 
> "Data Types & Serialization" for details of the effect on performance.
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11911) KafkaTopicPartition is not a valid POJO

2019-03-13 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11911:


 Summary: KafkaTopicPartition is not a valid POJO
 Key: FLINK-11911
 URL: https://issues.apache.org/jira/browse/FLINK-11911
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.7.2
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.8.0


KafkaTopicPartition is not a POJO, and therefore it cannot be serialized 
efficiently. This is using the KafkaDeserializationSchema.

When enforcing POJO's:
```
java.lang.UnsupportedOperationException: Generic types have been disabled in 
the ExecutionConfig and type 
org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition is 
treated as a generic type.
at 
org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:86)
at 
org.apache.flink.api.java.typeutils.TupleTypeInfo.createSerializer(TupleTypeInfo.java:107)
at 
org.apache.flink.api.java.typeutils.TupleTypeInfo.createSerializer(TupleTypeInfo.java:52)
at 
org.apache.flink.api.java.typeutils.ListTypeInfo.createSerializer(ListTypeInfo.java:102)
at 
org.apache.flink.api.common.state.StateDescriptor.initializeSerializerUnlessSet(StateDescriptor.java:288)
at 
org.apache.flink.runtime.state.DefaultOperatorStateBackend.getListState(DefaultOperatorStateBackend.java:289)
at 
org.apache.flink.runtime.state.DefaultOperatorStateBackend.getUnionListState(DefaultOperatorStateBackend.java:219)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.initializeState(FlinkKafkaConsumerBase.java:856)
at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:278)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
```

And in the logs:
```
2019-03-13 16:41:28,217 INFO  org.apache.flink.api.java.typeutils.TypeExtractor 
- class 
org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition does 
not contain a setter for field topic
2019-03-13 16:41:28,221 INFO  org.apache.flink.api.java.typeutils.TypeExtractor 
- Class class 
org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition 
cannot be used as a POJO type because not all fields are valid POJO fields, and 
must be processed as GenericType. Please read the Flink documentation on "Data 
Types & Serialization" for details of the effect on performance.
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11883) Harmonize the versions of the

2019-03-12 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11883:


 Summary: Harmonize the versions of the 
 Key: FLINK-11883
 URL: https://issues.apache.org/jira/browse/FLINK-11883
 Project: Flink
  Issue Type: Bug
Reporter: Fokko Driesprong






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (FLINK-11883) Harmonize the version of maven-shade-plugin

2019-03-12 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong reassigned FLINK-11883:


Assignee: Fokko Driesprong

> Harmonize the version of maven-shade-plugin
> ---
>
> Key: FLINK-11883
> URL: https://issues.apache.org/jira/browse/FLINK-11883
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.7.2
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11883) Harmonize the version of maven-shade-plugin

2019-03-12 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated FLINK-11883:
-
Component/s: Build System

> Harmonize the version of maven-shade-plugin
> ---
>
> Key: FLINK-11883
> URL: https://issues.apache.org/jira/browse/FLINK-11883
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.7.2
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11883) Harmonize the version of maven-shade-plugin

2019-03-12 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated FLINK-11883:
-
Affects Version/s: 1.7.2

> Harmonize the version of maven-shade-plugin
> ---
>
> Key: FLINK-11883
> URL: https://issues.apache.org/jira/browse/FLINK-11883
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.7.2
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11883) Harmonize the version of maven-shade-plugin

2019-03-12 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated FLINK-11883:
-
Summary: Harmonize the version of maven-shade-plugin  (was: Harmonize the 
versions of the )

> Harmonize the version of maven-shade-plugin
> ---
>
> Key: FLINK-11883
> URL: https://issues.apache.org/jira/browse/FLINK-11883
> Project: Flink
>  Issue Type: Bug
>    Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-11378) Allow HadoopRecoverableWriter to write to Hadoop compatible Filesystems

2019-03-06 Thread Fokko Driesprong (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785587#comment-16785587
 ] 

Fokko Driesprong edited comment on FLINK-11378 at 3/6/19 12:55 PM:
---

FLINK-11838 should supersede this ticket.


was (Author: fokko):
FLINK-11378 should supersede this ticket.

> Allow HadoopRecoverableWriter to write to Hadoop compatible Filesystems
> ---
>
> Key: FLINK-11378
> URL: https://issues.apache.org/jira/browse/FLINK-11378
> Project: Flink
>  Issue Type: Improvement
>  Components: FileSystems
>Reporter: Martijn van de Grift
>Assignee: Martijn van de Grift
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> At a client we're using Flink jobs to read data from Kafka and writing to 
> GCS. In earlier versions, we've used `BucketingFileSink` for this, but we 
> want to switch to the newer `StreamingFileSink`.
> Since we're running Flink on Google's DataProc, we're using the Hadoop 
> compatible GCS 
> [connector|https://github.com/GoogleCloudPlatform/bigdata-interop] made by 
> Google. This currently doesn't work on Flink, because Flink checks for a HDFS 
> scheme at 'HadoopRecoverableWriter'.
> We've successfully ran our jobs by creating a custom Flink Distro which has 
> the hdfs scheme check removed.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11838) Create RecoverableWriter for GCS

2019-03-06 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11838:


 Summary: Create RecoverableWriter for GCS
 Key: FLINK-11838
 URL: https://issues.apache.org/jira/browse/FLINK-11838
 Project: Flink
  Issue Type: Improvement
  Components: FileSystems
Affects Versions: 1.8.0
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong


GCS supports the resumable upload which we can use to create a Recoverable 
writer similar to the S3 implementation:
https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload

After using the Hadoop compatible interface: 
https://github.com/apache/flink/pull/7519
We've noticed that the current implementation relies heavily on the renaming of 
the files on the commit: 
https://github.com/apache/flink/blob/master/flink-filesystems/flink-hadoop-fs/src/main/java/org/apache/flink/runtime/fs/hdfs/HadoopRecoverableFsDataOutputStream.java#L233-L259
This is suboptimal on an object store such as GCS. Therefore we would like to 
implement a more GCS native RecoverableWriter 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11378) Allow HadoopRecoverableWriter to write to Hadoop compatible Filesystems

2019-03-06 Thread Fokko Driesprong (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785587#comment-16785587
 ] 

Fokko Driesprong commented on FLINK-11378:
--

FLINK-11378 should supersede this ticket.

> Allow HadoopRecoverableWriter to write to Hadoop compatible Filesystems
> ---
>
> Key: FLINK-11378
> URL: https://issues.apache.org/jira/browse/FLINK-11378
> Project: Flink
>  Issue Type: Improvement
>  Components: FileSystems
>Reporter: Martijn van de Grift
>Assignee: Martijn van de Grift
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> At a client we're using Flink jobs to read data from Kafka and writing to 
> GCS. In earlier versions, we've used `BucketingFileSink` for this, but we 
> want to switch to the newer `StreamingFileSink`.
> Since we're running Flink on Google's DataProc, we're using the Hadoop 
> compatible GCS 
> [connector|https://github.com/GoogleCloudPlatform/bigdata-interop] made by 
> Google. This currently doesn't work on Flink, because Flink checks for a HDFS 
> scheme at 'HadoopRecoverableWriter'.
> We've successfully ran our jobs by creating a custom Flink Distro which has 
> the hdfs scheme check removed.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11401) Allow compression on ParquetBulkWriter

2019-01-23 Thread Fokko Driesprong (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749744#comment-16749744
 ] 

Fokko Driesprong commented on FLINK-11401:
--

Thanks for the comment [~StephanEwen]

The RollOnCheckpoint behavior works very well for our use case, which is just 
ETL'ing the data from Kafka to a bucket. Since we're using an object store FS 
Backend (GCS), the renaming constant renaming of the files to `.in-progress` to 
`.pending` to `.avro` are far from optimal since renaming is very expensive. On 
HDFS this is a constant and atomic logic operation, in contrast when using an 
object store where this implies copying the whole file.

In the near future, we'll open a PR for the Avro writer, implementing the 
BulkWriter. Since Avro is still in a container (we want to include the schema 
in the header of the file), we still need to write a header, before writing the 
actual rows. Writing this header first would require changing some interfaces.


> Allow compression on ParquetBulkWriter
> --
>
> Key: FLINK-11401
> URL: https://issues.apache.org/jira/browse/FLINK-11401
> Project: Flink
>  Issue Type: Improvement
>  Components: Batch Connectors and Input/Output Formats
>Affects Versions: 1.7.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.8.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11401) Allow compression on ParquetBulkWriter

2019-01-21 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11401:


 Summary: Allow compression on ParquetBulkWriter
 Key: FLINK-11401
 URL: https://issues.apache.org/jira/browse/FLINK-11401
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.7.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.8.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11371) Close the AvroParquetReader after use

2019-01-16 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11371:


 Summary: Close the AvroParquetReader after use
 Key: FLINK-11371
 URL: https://issues.apache.org/jira/browse/FLINK-11371
 Project: Flink
  Issue Type: Improvement
  Components: Formats
Affects Versions: 1.7.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.8.0


The AvroParquetReader is not being closed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11347) Optimize the ParquetAvroWriters factory

2019-01-16 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11347:


 Summary: Optimize the ParquetAvroWriters factory
 Key: FLINK-11347
 URL: https://issues.apache.org/jira/browse/FLINK-11347
 Project: Flink
  Issue Type: Improvement
  Components: Formats
Affects Versions: 1.7.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.8.0


In the ParquetAvroWriters the schema is first serialized to a string, and then 
back to a Schema, which is quite expensive to do. Therefore it makes sense to 
pass the schema to the writer directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11340) Bump commons-configuration from 1.7 to 1.10

2019-01-15 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11340:


 Summary: Bump commons-configuration from 1.7 to 1.10
 Key: FLINK-11340
 URL: https://issues.apache.org/jira/browse/FLINK-11340
 Project: Flink
  Issue Type: Improvement
  Components: Configuration, Core
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong


Bump commons-configuration from 1.7 to 1.10



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11339) Bump exec-maven-plugin from 1.5.0 to 1.6.0

2019-01-15 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11339:


 Summary:  Bump exec-maven-plugin from 1.5.0 to 1.6.0
 Key: FLINK-11339
 URL: https://issues.apache.org/jira/browse/FLINK-11339
 Project: Flink
  Issue Type: Improvement
  Components: Build System
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong


Bump exec-maven-plugin from 1.5.0 to 1.6.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11338) Bump maven-enforcer-plugin from 3.0.0-M1 to 3.0.0-M2

2019-01-15 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11338:


 Summary: Bump maven-enforcer-plugin from 3.0.0-M1 to 3.0.0-M2
 Key: FLINK-11338
 URL: https://issues.apache.org/jira/browse/FLINK-11338
 Project: Flink
  Issue Type: Improvement
  Components: Build System
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong


Bump maven-enforcer-plugin from 3.0.0-M1 to 3.0.0-M2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11322) Use try-with-resource for FlinkKafkaConsumer010

2019-01-14 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11322:


 Summary: Use try-with-resource for FlinkKafkaConsumer010
 Key: FLINK-11322
 URL: https://issues.apache.org/jira/browse/FLINK-11322
 Project: Flink
  Issue Type: Improvement
  Components: Kafka Connector
Affects Versions: 1.7.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.8.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11321) Clarify the NPE on fetching an nonexistent Kafka topic

2019-01-14 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11321:


 Summary: Clarify the NPE on fetching an nonexistent Kafka topic
 Key: FLINK-11321
 URL: https://issues.apache.org/jira/browse/FLINK-11321
 Project: Flink
  Issue Type: Improvement
  Components: Kafka Connector
Affects Versions: 1.7.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.8.0


Following exception isn't that descriptive:

java.lang.NullPointerException
at 
org.apache.flink.streaming.connectors.kafka.internal.Kafka09PartitionDiscoverer.getAllPartitionsForTopics(Kafka09PartitionDiscoverer.java:77)
at 
org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:131)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:473)
at 
org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11306) Bump derby from 10.10.1.1 to 10.14.2.0

2019-01-11 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11306:


 Summary: Bump derby from 10.10.1.1 to 10.14.2.0
 Key: FLINK-11306
 URL: https://issues.apache.org/jira/browse/FLINK-11306
 Project: Flink
  Issue Type: Bug
  Components: Core
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11265) Invalid reference to AvroSinkWriter in example AvroKeyValueSinkWriter

2019-01-04 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11265:


 Summary: Invalid reference to AvroSinkWriter in example 
AvroKeyValueSinkWriter
 Key: FLINK-11265
 URL: https://issues.apache.org/jira/browse/FLINK-11265
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.7.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.7.2






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11262) Bump jython-standalone to 2.7.1

2019-01-03 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11262:


 Summary: Bump jython-standalone to 2.7.1
 Key: FLINK-11262
 URL: https://issues.apache.org/jira/browse/FLINK-11262
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong


Due to security issue: 
https://ossindex.sonatype.org/vuln/7a4be7b3-74f5-4a9b-a24f-d1fd80ed6bbca



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11260) Bump Janino compiler dependency

2019-01-03 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated FLINK-11260:
-
Description: Bump the Janino dependency: 
http://janino-compiler.github.io/janino/changelog.html  (was: Bump the Janino 
depdency: http://janino-compiler.github.io/janino/changelog.html)

> Bump Janino compiler dependency
> ---
>
> Key: FLINK-11260
> URL: https://issues.apache.org/jira/browse/FLINK-11260
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.7.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Bump the Janino dependency: 
> http://janino-compiler.github.io/janino/changelog.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11260) Bump Janino compiler dependency

2019-01-03 Thread Fokko Driesprong (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated FLINK-11260:
-
Description: Bump the Janino depdency: 
http://janino-compiler.github.io/janino/changelog.html  (was: Bump the 
commons-compiler)

> Bump Janino compiler dependency
> ---
>
> Key: FLINK-11260
> URL: https://issues.apache.org/jira/browse/FLINK-11260
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.7.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Bump the Janino depdency: 
> http://janino-compiler.github.io/janino/changelog.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11260) Bump Janino compiler dependency

2019-01-03 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11260:


 Summary: Bump Janino compiler dependency
 Key: FLINK-11260
 URL: https://issues.apache.org/jira/browse/FLINK-11260
 Project: Flink
  Issue Type: Improvement
  Components: Core
Affects Versions: 1.7.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.7.2


Bump the commons-compiler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11259) Bump Zookeeper dependency to 3.4.13

2019-01-03 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11259:


 Summary: Bump Zookeeper dependency to 3.4.13
 Key: FLINK-11259
 URL: https://issues.apache.org/jira/browse/FLINK-11259
 Project: Flink
  Issue Type: Improvement
  Components: Cluster Management
Affects Versions: 1.7.1
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong
 Fix For: 1.7.2


Bump Zookeeper to 3.4.13

https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11258) Add badge to the README

2019-01-03 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-11258:


 Summary: Add badge to the README
 Key: FLINK-11258
 URL: https://issues.apache.org/jira/browse/FLINK-11258
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Fokko Driesprong
Assignee: Fokko Driesprong


I think we should add the badge to the docs to check if master is still happy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[GitHub] flink issue #4072: [FLINK-6848] Update managed state docs

2017-06-09 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/4072
  
Hi @tzulitai,

I've added some more Scala examples. If you are still missing something, 
please let me know.

Kind regards, Fokko


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4072: [FLINK-6848] Update managed state docs

2017-06-07 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/4072
  
Hi @tzulitai,

I fully agree. Give me some time to work on the other Scala examples, I 
need to make sure that they are working properly. I'll wrap it up this week. 
I'll update the commit and rebase with master.

    Cheers, Fokko


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #4072: [FLINK-6848] Update managed state docs

2017-06-04 Thread Fokko

GitHub user Fokko opened a pull request:

https://github.com/apache/flink/pull/4072

[FLINK-6848] Update managed state docs

Hi guys,

I would like to add an example of how to work with managed state in Scala. 
The code is tested locally and might be a nice addition to the docs.

Cheers,
Fokko Driesprong

Thanks for contributing to Apache Flink. Before you open your pull request, 
please take the following check list into consideration.
If your changes take all of the items into account, feel free to open your 
pull request. For more information and/or questions please refer to the [How To 
Contribute guide](http://flink.apache.org/how-to-contribute.html).
In addition to going through the list, please provide a meaningful 
description of your changes.

- [x] General
  - The pull request references the related JIRA issue ("[FLINK-XXX] Jira 
title text")
  - The pull request addresses only one issue
  - Each commit in the PR has a meaningful commit message (including the 
JIRA id)

- [x] Documentation
  - Documentation has been added for new functionality
  - Old documentation affected by the pull request has been updated
  - JavaDoc for public methods has been added

- [x] Tests & Build
  - Functionality added by the pull request is covered by tests
  - `mvn clean verify` has been executed successfully locally or a Travis 
build has passed


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Fokko/flink 
fd-update-raw-and-managed-state-docs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/4072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4072


commit 8103fc28d10d131eb1273dba4b477c25ac278bf0
Author: Fokko Driesprong 
Date:   2017-06-04T14:08:44Z

Update managed state docs

Add an example of how to work with managed state in Scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-6848) Extend the managed state docs with a Scala example

2017-06-04 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-6848:
---

 Summary: Extend the managed state docs with a Scala example
 Key: FLINK-6848
 URL: https://issues.apache.org/jira/browse/FLINK-6848
 Project: Flink
  Issue Type: Bug
Reporter: Fokko Driesprong


Hi all,

It would be nice to add a Scala example code snippet in the Managed state docs. 
This makes it a bit easier to start using managed state in Scala. The code is 
tested and works.

Kind regards,
Fokko



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[GitHub] flink pull request #3280: [Flink-5724] Error in documentation zipping elemen...

2017-02-06 Thread Fokko

GitHub user Fokko opened a pull request:

https://github.com/apache/flink/pull/3280

[Flink-5724] Error in documentation zipping elements

Because the Scala tab is defined twice, it is not possible to open
the Python tab.

Please look at the documentation page itself:

https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/batch/zip_elements_guide.html

I believe it is a copy-paste error (which also happens to me too often ;)


Thanks for contributing to Apache Flink. Before you open your pull request, 
please take the following check list into consideration.
If your changes take all of the items into account, feel free to open your 
pull request. For more information and/or questions please refer to the [How To 
Contribute guide](http://flink.apache.org/how-to-contribute.html).
In addition to going through the list, please provide a meaningful 
description of your changes.

- [x] General
  - The pull request references the related JIRA issue ("[FLINK-XXX] Jira 
title text")
  - The pull request addresses only one issue
  - Each commit in the PR has a meaningful commit message (including the 
JIRA id)

- [x] Documentation
  - Documentation has been added for new functionality
  - Old documentation affected by the pull request has been updated
  - JavaDoc for public methods has been added

- [x] Tests & Build
  - Functionality added by the pull request is covered by tests
  - `mvn clean verify` has been executed successfully locally or a Travis 
build has passed


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Fokko/flink fd-fix-docs-zipping-elements

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3280


commit f490fbcd3633f77e87d89ba397c2b1ed1a030543
Author: Fokko Driesprong 
Date:   2017-02-06T20:08:31Z

Error in documentation zipping elements

Because the Scala tab is defined twice, it is not possible to open
the Python tab.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-5724) Error in the 'Zipping Elements' docs

2017-02-06 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-5724:
---

 Summary: Error in the 'Zipping Elements' docs
 Key: FLINK-5724
 URL: https://issues.apache.org/jira/browse/FLINK-5724
 Project: Flink
  Issue Type: Bug
Reporter: Fokko Driesprong


The tab for the Python documentation isn't working because there are two tabs 
pointing at the Scala example.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[GitHub] flink pull request #:

2017-01-26 Thread Fokko

Github user Fokko commented on the pull request:


https://github.com/apache/flink/commit/32e1675aa38eec4a15272d62977dfe3ddbe92401#commitcomment-20632930
  
Did not know this, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3077: [FLINK-5423] Implement Stochastic Outlier Selection

2017-01-19 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/3077
  
@tillrohrmann I've added documentation about the algorithm. Can you check? 

Cheers, Fokko


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3077: [FLINK-5423] Implement Stochastic Outlier Selectio...

2017-01-16 Thread Fokko

Github user Fokko commented on a diff in the pull request:

https://github.com/apache/flink/pull/3077#discussion_r96290365
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/outlier/StochasticOutlierSelection.scala
 ---
@@ -0,0 +1,367 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.outlier
+
+/** An implementation of the Stochastic Outlier Selection algorithm by 
Jeroen Jansen
+  *
+  * For more information about SOS, see 
https://github.com/jeroenjanssens/sos
+  * J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. 
Stochastic
+  * Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg 
University,
+  * Tilburg, the Netherlands, 2012.
+  *
+  * @example
+  *  {{{
+  * val inputDS = env.fromCollection(List(
+  *   LabeledVector(0.0, DenseVector(1.0, 1.0)),
+  *   LabeledVector(1.0, DenseVector(2.0, 1.0)),
+  *   LabeledVector(2.0, DenseVector(1.0, 2.0)),
+  *   LabeledVector(3.0, DenseVector(2.0, 2.0)),
+  *   LabeledVector(4.0, DenseVector(5.0, 8.0)) // The outlier!
+  * ))
+  *
+  * val sos = StochasticOutlierSelection()
+  *   .setPerplexity(3)
+  *
+  * val outputDS = sos.transform(inputDS)
+  *
+  * val expectedOutputDS = Array(
+  *0.2790094479202896,
+  *0.25775014551682535,
+  *0.22136130977995766,
+  *0.12707053787018444,
+  *0.9922779902453757 // The outlier!
+  * )
+  *
+  * assert(outputDS == expectedOutputDS)
+  *  }}}
+  *
+  * =Parameters=
+  *
+  *  - 
[[org.apache.flink.ml.outlier.StochasticOutlierSelection.Perplexity]]:
+  *  Perplexity can be interpreted as the k in k-nearest neighbor 
algorithms. The difference is that
+  *  in SOS being a neighbor is not a binary property, but a probabilistic 
one. Should be between
+  *  1 and n-1, where n is the number of observations.
+  *  (Default value: '''30''')
+  *
+  *  - 
[[org.apache.flink.ml.outlier.StochasticOutlierSelection.ErrorTolerance]]:
+  *  The accepted error tolerance. When increasing this number, it will 
sacrifice accuracy in
+  *  return for reduced computational time.
+  *  (Default value: '''1e-20''')
+  *
+  *  - 
[[org.apache.flink.ml.outlier.StochasticOutlierSelection.MaxIterations]]:
+  *  The maximum number of iterations to perform. (Default value: 
'''5000''')
+  */
+
+import breeze.linalg.functions.euclideanDistance
+import breeze.linalg.{sum, DenseVector => BreezeDenseVector, Vector => 
BreezeVector}
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap, 
WithParameters}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{TransformDataSetOperation, 
Transformer}
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+class StochasticOutlierSelection extends 
Transformer[StochasticOutlierSelection] {
+
+  import StochasticOutlierSelection._
+
+
+  /** Sets the perplexity of the outlier selection algorithm, can be seen 
as the k of kNN
+* For more information, please read the Stochastic Outlier Selection 
algorithm paper
+*
+* @param perplexity the perplexity of the affinity fit
+* @return
+*/
+  def setPerplexity(perplexity: Double): StochasticOutlierSelection = {
+requir

[GitHub] flink pull request #3077: [FLINK-5423] Implement Stochastic Outlier Selectio...

2017-01-16 Thread Fokko

Github user Fokko commented on a diff in the pull request:

https://github.com/apache/flink/pull/3077#discussion_r96290218
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/outlier/StochasticOutlierSelection.scala
 ---
@@ -0,0 +1,367 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.outlier
+
+/** An implementation of the Stochastic Outlier Selection algorithm by 
Jeroen Jansen
+  *
+  * For more information about SOS, see 
https://github.com/jeroenjanssens/sos
+  * J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. 
Stochastic
+  * Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg 
University,
+  * Tilburg, the Netherlands, 2012.
+  *
+  * @example
+  *  {{{
+  * val inputDS = env.fromCollection(List(
+  *   LabeledVector(0.0, DenseVector(1.0, 1.0)),
+  *   LabeledVector(1.0, DenseVector(2.0, 1.0)),
+  *   LabeledVector(2.0, DenseVector(1.0, 2.0)),
+  *   LabeledVector(3.0, DenseVector(2.0, 2.0)),
+  *   LabeledVector(4.0, DenseVector(5.0, 8.0)) // The outlier!
+  * ))
+  *
+  * val sos = StochasticOutlierSelection()
+  *   .setPerplexity(3)
+  *
+  * val outputDS = sos.transform(inputDS)
+  *
+  * val expectedOutputDS = Array(
+  *0.2790094479202896,
+  *0.25775014551682535,
+  *0.22136130977995766,
+  *0.12707053787018444,
+  *0.9922779902453757 // The outlier!
+  * )
+  *
+  * assert(outputDS == expectedOutputDS)
+  *  }}}
+  *
+  * =Parameters=
+  *
+  *  - 
[[org.apache.flink.ml.outlier.StochasticOutlierSelection.Perplexity]]:
+  *  Perplexity can be interpreted as the k in k-nearest neighbor 
algorithms. The difference is that
+  *  in SOS being a neighbor is not a binary property, but a probabilistic 
one. Should be between
+  *  1 and n-1, where n is the number of observations.
+  *  (Default value: '''30''')
+  *
+  *  - 
[[org.apache.flink.ml.outlier.StochasticOutlierSelection.ErrorTolerance]]:
+  *  The accepted error tolerance. When increasing this number, it will 
sacrifice accuracy in
+  *  return for reduced computational time.
+  *  (Default value: '''1e-20''')
+  *
+  *  - 
[[org.apache.flink.ml.outlier.StochasticOutlierSelection.MaxIterations]]:
+  *  The maximum number of iterations to perform. (Default value: 
'''5000''')
+  */
+
+import breeze.linalg.functions.euclideanDistance
+import breeze.linalg.{sum, DenseVector => BreezeDenseVector, Vector => 
BreezeVector}
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap, 
WithParameters}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{TransformDataSetOperation, 
Transformer}
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+class StochasticOutlierSelection extends 
Transformer[StochasticOutlierSelection] {
+
+  import StochasticOutlierSelection._
+
+
+  /** Sets the perplexity of the outlier selection algorithm, can be seen 
as the k of kNN
+* For more information, please read the Stochastic Outlier Selection 
algorithm paper
+*
+* @param perplexity the perplexity of the affinity fit
+* @return
+*/
+  def setPerplexity(perplexity: Double): StochasticOutlierSelection = {
+requir

[GitHub] flink pull request #3077: [FLINK-5423] Implement Stochastic Outlier Selectio...

2017-01-16 Thread Fokko

Github user Fokko commented on a diff in the pull request:

https://github.com/apache/flink/pull/3077#discussion_r96290195
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/outlier/StochasticOutlierSelection.scala
 ---
@@ -0,0 +1,367 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.outlier
+
+/** An implementation of the Stochastic Outlier Selection algorithm by 
Jeroen Jansen
+  *
+  * For more information about SOS, see 
https://github.com/jeroenjanssens/sos
+  * J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. 
Stochastic
+  * Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg 
University,
+  * Tilburg, the Netherlands, 2012.
+  *
+  * @example
+  *  {{{
+  * val inputDS = env.fromCollection(List(
+  *   LabeledVector(0.0, DenseVector(1.0, 1.0)),
+  *   LabeledVector(1.0, DenseVector(2.0, 1.0)),
+  *   LabeledVector(2.0, DenseVector(1.0, 2.0)),
+  *   LabeledVector(3.0, DenseVector(2.0, 2.0)),
+  *   LabeledVector(4.0, DenseVector(5.0, 8.0)) // The outlier!
+  * ))
+  *
+  * val sos = StochasticOutlierSelection()
+  *   .setPerplexity(3)
+  *
+  * val outputDS = sos.transform(inputDS)
+  *
+  * val expectedOutputDS = Array(
+  *0.2790094479202896,
+  *0.25775014551682535,
+  *0.22136130977995766,
+  *0.12707053787018444,
+  *0.9922779902453757 // The outlier!
+  * )
+  *
+  * assert(outputDS == expectedOutputDS)
+  *  }}}
+  *
+  * =Parameters=
+  *
+  *  - 
[[org.apache.flink.ml.outlier.StochasticOutlierSelection.Perplexity]]:
+  *  Perplexity can be interpreted as the k in k-nearest neighbor 
algorithms. The difference is that
+  *  in SOS being a neighbor is not a binary property, but a probabilistic 
one. Should be between
+  *  1 and n-1, where n is the number of observations.
+  *  (Default value: '''30''')
+  *
+  *  - 
[[org.apache.flink.ml.outlier.StochasticOutlierSelection.ErrorTolerance]]:
+  *  The accepted error tolerance. When increasing this number, it will 
sacrifice accuracy in
+  *  return for reduced computational time.
+  *  (Default value: '''1e-20''')
+  *
+  *  - 
[[org.apache.flink.ml.outlier.StochasticOutlierSelection.MaxIterations]]:
+  *  The maximum number of iterations to perform. (Default value: 
'''5000''')
+  */
+
+import breeze.linalg.functions.euclideanDistance
+import breeze.linalg.{sum, DenseVector => BreezeDenseVector, Vector => 
BreezeVector}
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap, 
WithParameters}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{TransformDataSetOperation, 
Transformer}
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+class StochasticOutlierSelection extends 
Transformer[StochasticOutlierSelection] {
+
+  import StochasticOutlierSelection._
+
+
+  /** Sets the perplexity of the outlier selection algorithm, can be seen 
as the k of kNN
+* For more information, please read the Stochastic Outlier Selection 
algorithm paper
+*
+* @param perplexity the perplexity of the affinity fit
+* @return
+*/
+  def setPerplexity(perplexity: Double): StochasticOutlierSelection = {
+requir

[GitHub] flink pull request #3077: [FLINK-5423] Implement Stochastic Outlier Selectio...

2017-01-16 Thread Fokko

Github user Fokko commented on a diff in the pull request:

https://github.com/apache/flink/pull/3077#discussion_r96289143
  
--- Diff: 
flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/outlier/StochasticOutlierSelectionITSuite.scala
 ---
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.outlier
+
+import breeze.linalg.{sum, DenseVector => BreezeDenseVector}
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.DenseVector
+import 
org.apache.flink.ml.outlier.StochasticOutlierSelection.BreezeLabeledVector
+import org.apache.flink.ml.util.FlinkTestBase
+import org.scalatest.{FlatSpec, Matchers}
+
+class StochasticOutlierSelectionITSuite extends FlatSpec with Matchers 
with FlinkTestBase {
+  behavior of "Stochastic Outlier Selection algorithm"
+  val EPSILON = 1e-16
+
+  /*
+Unit-tests created based on the Python scripts of the algorithms 
author'
+https://github.com/jeroenjanssens/scikit-sos
+
+For more information about SOS, see 
https://github.com/jeroenjanssens/sos
+J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. 
Stochastic
+Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg 
University,
+Tilburg, the Netherlands, 2012.
+   */
+
+  val perplexity = 3
+  val errorTolerance = 0
+  val maxIterations = 5000
+  val parameters = new 
StochasticOutlierSelection().setPerplexity(perplexity).parameters
+
+  val env = ExecutionEnvironment.getExecutionEnvironment
+
+  it should "Compute the perplexity of the vector and return the correct 
error" in {
+val vector = BreezeDenseVector(Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 
8.0, 9.0, 10.0))
+
+val output = Array(
+  0.39682901665799636,
+  0.15747326846175236,
+  0.06248996227359784,
+  0.024797830280027126,
+  0.009840498605275054,
+  0.0039049953849556816,
+  6.149323865970302E-4,
+  2.4402301428445443E-4,
+  9.683541280042027E-5
+)
+
+val search = StochasticOutlierSelection.binarySearch(
+  vector,
+  Math.log(perplexity),
+  maxIterations,
+  errorTolerance
+).toArray
+
+search should be(output)
+  }
+
+  it should "Compute the distance matrix and give symmetrical distances" 
in {
+
+val data = env.fromCollection(List(
+  BreezeLabeledVector(0, BreezeDenseVector(Array(1.0, 3.0))),
+  BreezeLabeledVector(1, BreezeDenseVector(Array(5.0, 1.0)))
+))
+
+val distanceMatrix = StochasticOutlierSelection
+  .computeDissimilarityVectors(data)
+  .map(_.data)
+  .collect()
+  .toArray
+
+print(distanceMatrix)
--- End diff --

Oops, still in there from the debugging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3077: [FLINK-5423] Implement Stochastic Outlier Selection

2017-01-16 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/3077
  
Thanks @tillrohrmann, excellent idea regarding the documentation. I'll also 
process the code comments, good feedback. Somewhere today or tomorrow I will 
fix this.

Cheers, Fokko


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3081: [FLINK-5426] Clean up the Flink Machine Learning library

2017-01-15 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/3081
  
I've rebased both branches with master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3081: [FLINK-5426] Clean up the Flink Machine Learning library

2017-01-15 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/3081
  
Hi @tillrohrmann, did you find any time to check #3077 and this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3081: [FLINK-5426] Clean up the Flink Machine Learning library

2017-01-08 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/3081
  
I think we have some flakey tests, since it passes on my own travis:
https://travis-ci.org/Fokko/flink/builds/189854940


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3077: [FLINK-5423] Implement Stochastic Outlier Selection

2017-01-08 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/3077
  
I think we have some flakey tests, since it passes on my own travis:
https://travis-ci.org/Fokko/flink/builds/189855914


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3081: [FLINK-5426] Clean up the Flink Machine Learning library

2017-01-08 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/3081
  
I think we have some flakey tests, since it passes on my own travis:
https://travis-ci.org/Fokko/flink/builds/189855914


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3081: [FLINK-5426] Clean up the Flink Machine Learning l...

2017-01-07 Thread Fokko

GitHub user Fokko opened a pull request:

https://github.com/apache/flink/pull/3081

[FLINK-5426] Clean up the Flink Machine Learning library

Hi guys,

I would like to contribute to the Flink ML library. I took the liberty to 
clean up some of the code and improve the scaladoc. Beside that I've 
implemented #3077 to get more familiar with the Flink API and I would love to 
contribute more in the future, in particular the machine learning library.

If you have any questions, please let me know. Let me know if improvements 
to the ML library are appreciated in general.

- [x] General
  - The pull request references the related JIRA issue ("[FLINK-XXX] Jira 
title text")
  - The pull request addresses only one issue
  - Each commit in the PR has a meaningful commit message (including the 
JIRA id)

- [x] Documentation
  - Documentation has been added for new functionality
  - Old documentation affected by the pull request has been updated
  - JavaDoc for public methods has been added

- [x] Tests & Build
  - Functionality added by the pull request is covered by tests
  - `mvn clean verify` has been executed successfully locally or a Travis 
build has passed


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Fokko/flink fd-cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3081.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3081


commit 013b22d7bcaf48c8e96983295fcc455faf0aa94b
Author: Fokko Driesprong 
Date:   2017-01-06T20:34:53Z

Removed duplicate tests, inproved scaladoc and naming, removed typo's in 
scaladoc, introduced and improved use of constants, improved test-case naming.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-5426) Clean up the Flink Machine Learning library

2017-01-07 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-5426:
---

 Summary: Clean up the Flink Machine Learning library
 Key: FLINK-5426
 URL: https://issues.apache.org/jira/browse/FLINK-5426
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Fokko Driesprong


Hi Guys,

I would like to clean up the Machine Learning library. A lot of the code in the 
ML Library does not conform to the original contribution guide. For example:

Duplicate tests, different names, but exactly the same testcase:
https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/math/DenseVectorSuite.scala#L148
https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/math/DenseVectorSuite.scala#L164

Lot of multi-line tests-cases:
https://github.com/Fokko/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/math/DenseVectorSuite.scala

Mis-use of constants:
https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/math/DenseMatrix.scala#L58

Please allow me to clean this up, and I'm looking forward to contribute more 
code, especially to the ML part. I've have been a contributor to Apache Spark 
and am happy to extend the codebase with new distributed algorithms and make 
the codebase more mature.

Cheers, Fokko



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request #3077: [FLINK-5423] Implement Stochastic Outlier Selectio...

2017-01-06 Thread Fokko

GitHub user Fokko opened a pull request:

https://github.com/apache/flink/pull/3077

[FLINK-5423] Implement Stochastic Outlier Selection

Implemented the Stochastic Outlier Selection algorithm in the Machine 
Learning library, including the test code. Added documentation.

Thanks for contributing to Apache Flink. Before you open your pull request, 
please take the following check list into consideration.
If your changes take all of the items into account, feel free to open your 
pull request. For more information and/or questions please refer to the [How To 
Contribute guide](http://flink.apache.org/how-to-contribute.html).
In addition to going through the list, please provide a meaningful 
description of your changes.

- [ ] General
  - The pull request references the related JIRA issue ("[FLINK-XXX] Jira 
title text")
  - The pull request addresses only one issue
  - Each commit in the PR has a meaningful commit message (including the 
JIRA id)

- [ ] Documentation
  - Documentation has been added for new functionality
  - Old documentation affected by the pull request has been updated
  - JavaDoc for public methods has been added

- [ ] Tests & Build
  - Functionality added by the pull request is covered by tests
  - `mvn clean verify` has been executed successfully locally or a Travis 
build has passed


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Fokko/flink 
fd-implement-stochastic-outlier-selection

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3077.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3077


commit a67b170f6eee6c053322a4730f1b8dcaa680a112
Author: Fokko Driesprong 
Date:   2016-12-30T06:38:52Z

Implemented the Stochastic Outlier Selection algorithm in the Machine 
Learning library, including the test code. Added documentation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-5423) Implement Stochastic Outlier Selection

2017-01-06 Thread Fokko Driesprong (JIRA)

Fokko Driesprong created FLINK-5423:
---

 Summary: Implement Stochastic Outlier Selection
 Key: FLINK-5423
 URL: https://issues.apache.org/jira/browse/FLINK-5423
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Fokko Driesprong


I've implemented the Stochastic Outlier Selection (SOS) algorithm by Jeroen 
Jansen.
http://jeroenjanssens.com/2013/11/24/stochastic-outlier-selection.html
Integrated as much as possible with the components from the machine learning 
library.

The algorithm itself has been compared to four other algorithms and it it shows 
that SOS has a higher performance on most of these real-world datasets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request #3048: Clarified the import path of the Breeze DenseVecto...

2017-01-06 Thread Fokko

Github user Fokko closed the pull request at:

https://github.com/apache/flink/pull/3048


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2442: [FLINK-4148] incorrect calculation minDist distance in Qu...

2017-01-06 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/2442
  
Looks good, please merge. Should have been fixed long ago :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3052: Swap the pattern matching order

2017-01-06 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/3052
  
Alright, just rebased with master. Looks like that the Travis is working 
again, good job!

Cheers!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3052: Swap the pattern matching order

2017-01-06 Thread Fokko

Github user Fokko closed the pull request at:

https://github.com/apache/flink/pull/3052


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3052: Swap the pattern matching order

2016-12-29 Thread Fokko

GitHub user Fokko opened a pull request:

https://github.com/apache/flink/pull/3052

Swap the pattern matching order

Swap the pattern matching order, because `EuclideanDistanceMetric extends 
SquaredEuclideanDistanceMetric extends DistanceMetric`, otherwise the 
EuclideanDistance cannot be executed:

```
[WARNING] 
/Users/fokkodriesprong/Desktop/flink-fokko/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala:106:
 warning: unreachable code
[WARNING] case _: EuclideanDistanceMetric => math.sqrt(minDist)
[WARNING] ^
[WARNING] warning: Class org.apache.log4j.Level not found - continuing with 
a stub.
[WARNING] warning: there were 1 feature warning(s); re-run with -feature 
for details
[WARNING] three warnings found
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Fokko/flink fd-fix-pattern-matching

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3052.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3052


commit 29ebffc77cfbe917796f44764936972b578ebd38
Author: Fokko Driesprong 
Date:   2016-12-29T22:49:14Z

Swap the pattern matching order, because EuclideanDistanceMetric extends 
SquaredEuclideanDistanceMetric extends DistanceMetric, otherwise the 
EuclideanDistance cannot be executed.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3048: Clarified the import path of the Breeze DenseVector

2016-12-27 Thread Fokko

Github user Fokko commented on the issue:

https://github.com/apache/flink/pull/3048
  
Locally the tests just pass. Looking at the error logs, it doesn't have to 
do with the changes in the PR, for example:
```
java.io.FileNotFoundException: build-target/lib/flink-dist-*.jar (No such 
file or directory)
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:220)
at java.util.zip.ZipFile.(ZipFile.java:150)
at java.util.zip.ZipFile.(ZipFile.java:121)
at sun.tools.jar.Main.list(Main.java:1060)
at sun.tools.jar.Main.run(Main.java:291)
at sun.tools.jar.Main.main(Main.java:1233)
find: `./flink-yarn-tests/target/flink-yarn-tests*': No such file or 
directory
```





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3048: Clarified the import path of the Breeze DenseVecto...

2016-12-27 Thread Fokko

GitHub user Fokko opened a pull request:

https://github.com/apache/flink/pull/3048

Clarified the import path of the Breeze DenseVector

Guys,

I'm working on an extension of the ml library on Flink, but I stumbled upon 
this. Since it is such a trivial fix, I didn't created a JIRA request. Keep up 
the good work!

Cheers,

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Fokko/flink fd-cleanup-package-structure

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3048


commit 3fd38fe9785d607a05d045cd54a05af9ed48e350
Author: Fokko Driesprong 
Date:   2016-12-27T14:43:15Z

Replaced the full import path with the BreezeDenseVector itself to make it 
more readable




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

69 matches

Mail list logo