[jira] [Updated] (FLINK-9749) Rework Bucketing Sink
[ https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated FLINK-9749: Description: The BucketingSink has a series of deficits at the moment. Due to the long list of issues, I would suggest to add a new StreamingFileSink with a new and cleaner design h3. Encoders, Parquet, ORC - It only efficiently supports row-wise data formats (avro, json, sequence files). - Efforts to add (columnar) compression for blocks of data is inefficient, because blocks cannot span checkpoints due to persistence-on-checkpoint. - The encoders are part of the {{flink-connector-filesystem project}}, rather than in orthogonal formats projects. This blows up the dependencies of the {{flink-connector-filesystem project}} project. As an example, the rolling file sink has dependencies on Hadoop and Avro, which messes up dependency management. h3. Use of FileSystems - The BucketingSink works only on Hadoop's FileSystem abstraction not support Flink's own FileSystem abstraction and cannot work with the packaged S3, maprfs, and swift file systems - The sink hence needs Hadoop as a dependency - The sink relies on "trying out" whether truncation works, which requires write access to the users working directory - The sink relies on enumerating and counting files, rather than maintaining its own state, making less efficient h3. Correctness and Efficiency on S3 - The BucketingSink relies on strong consistency in the file enumeration, hence may work incorrectly on S3. - The BucketingSink relies on persisting streams at intermediate points. This is not working properly on S3, hence there may be data loss on S3. h3. .valid-length companion file - The valid length file makes it hard for consumers of the data and should be dropped We track this design in a series of sub issues. was: The BucketingSink has a series of deficits at the moment. Due to the long list of issues, I would suggest to add a new StreamingFileSink with a new and cleaner design h3. Encoders, Parquet, ORC - It only efficiently supports row-wise data formats (avro, jso, sequence files. - Efforts to add (columnar) compression for blocks of data is inefficient, because blocks cannot span checkpoints due to persistence-on-checkpoint. - The encoders are part of the \{{flink-connector-filesystem project}}, rather than in orthogonal formats projects. This blows up the dependencies of the \{{flink-connector-filesystem project}} project. As an example, the rolling file sink has dependencies on Hadoop and Avro, which messes up dependency management. h3. Use of FileSystems - The BucketingSink works only on Hadoop's FileSystem abstraction not support Flink's own FileSystem abstraction and cannot work with the packaged S3, maprfs, and swift file systems - The sink hence needs Hadoop as a dependency - The sink relies on "trying out" whether truncation works, which requires write access to the users working directory - The sink relies on enumerating and counting files, rather than maintaining its own state, making less efficient h3. Correctness and Efficiency on S3 - The BucketingSink relies on strong consistency in the file enumeration, hence may work incorrectly on S3. - The BucketingSink relies on persisting streams at intermediate points. This is not working properly on S3, hence there may be data loss on S3. h3. .valid-length companion file - The valid length file makes it hard for consumers of the data and should be dropped We track this design in a series of sub issues. > Rework Bucketing Sink > - > > Key: FLINK-9749 > URL: https://issues.apache.org/jira/browse/FLINK-9749 > Project: Flink > Issue Type: New Feature > Components: Connectors / FileSystem >Reporter: Stephan Ewen >Assignee: Kostas Kloudas >Priority: Major > > The BucketingSink has a series of deficits at the moment. > Due to the long list of issues, I would suggest to add a new > StreamingFileSink with a new and cleaner design > h3. Encoders, Parquet, ORC > - It only efficiently supports row-wise data formats (avro, json, sequence > files). > - Efforts to add (columnar) compression for blocks of data is inefficient, > because blocks cannot span checkpoints due to persistence-on-checkpoint. > - The encoders are part of the {{flink-connector-filesystem project}}, > rather than in orthogonal formats projects. This blows up the dependencies of > the {{flink-connector-filesystem project}} project. As an example, the > rolling file sink has dependencies on Hadoop and Avro, which messes up > dependency management. > h3. Use of FileSystems > - The BucketingSink works only on Hadoop's
[jira] [Created] (FLINK-13797) Add missing format argument
Fokko Driesprong created FLINK-13797: Summary: Add missing format argument Key: FLINK-13797 URL: https://issues.apache.org/jira/browse/FLINK-13797 Project: Flink Issue Type: Task Components: Deployment / Mesos Affects Versions: 1.8.1 Reporter: Fokko Driesprong -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-13796) Remove unused variable
Fokko Driesprong created FLINK-13796: Summary: Remove unused variable Key: FLINK-13796 URL: https://issues.apache.org/jira/browse/FLINK-13796 Project: Flink Issue Type: Task Components: Deployment / YARN Affects Versions: 1.8.1 Reporter: Fokko Driesprong -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (FLINK-12338) Update Apache Avro test to use try-with-resources
Fokko Driesprong created FLINK-12338: Summary: Update Apache Avro test to use try-with-resources Key: FLINK-12338 URL: https://issues.apache.org/jira/browse/FLINK-12338 Project: Flink Issue Type: Improvement Affects Versions: 1.8.0 Reporter: Fokko Driesprong Assignee: Fokko Driesprong Update Apache Avro test to use try-with-resources. Right now some resources aren't close at all. Having the try-with-resources increases readability of the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12330) Add integrated Tox for ensuring compatibility with the python2/3 version
[ https://issues.apache.org/jira/browse/FLINK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826702#comment-16826702 ] Fokko Driesprong commented on FLINK-12330: -- My personal preference is using Docker for this level of abstraction. Tox is quite a heavyweight wrapper, it is quite a hassle to set up which creates a barrier for newcomers. > Add integrated Tox for ensuring compatibility with the python2/3 version > > > Key: FLINK-12330 > URL: https://issues.apache.org/jira/browse/FLINK-12330 > Project: Flink > Issue Type: Sub-task > Components: API / Python >Affects Versions: 1.9.0 >Reporter: sunjincheng >Assignee: sunjincheng >Priority: Major > > Add integrated Tox for ensuring compatibility with the python2/3 version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12330) Add integrated Tox for ensuring compatibility with the python2/3 version
[ https://issues.apache.org/jira/browse/FLINK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825818#comment-16825818 ] Fokko Driesprong commented on FLINK-12330: -- Python2 is EOL end this year. Many projects are dropping support. Personally, I'm doubtful if it worth the investment (and the added complexity) of adding Tox to the project. > Add integrated Tox for ensuring compatibility with the python2/3 version > > > Key: FLINK-12330 > URL: https://issues.apache.org/jira/browse/FLINK-12330 > Project: Flink > Issue Type: Sub-task > Components: API / Python >Affects Versions: 1.9.0 >Reporter: sunjincheng >Priority: Major > > Add integrated Tox for ensuring compatibility with the python2/3 version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12250) Rewrite assembleNewPartPath to let it return a new PartPath
[ https://issues.apache.org/jira/browse/FLINK-12250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated FLINK-12250: - Description: While debugging some code, I've noticed assembleNewPartPath does not really return a new path. Also rewrote the code a bit so the mutable inProgressPart is changed in a single place (was: While debugging some code, I've noticed assembleNewPartPath does not really return a new path. Also rewrote the code a bit so the mutable inProgressPart is only changed in a single function. ) > Rewrite assembleNewPartPath to let it return a new PartPath > --- > > Key: FLINK-12250 > URL: https://issues.apache.org/jira/browse/FLINK-12250 > Project: Flink > Issue Type: Improvement >Affects Versions: 1.8.0 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > > While debugging some code, I've noticed assembleNewPartPath does not really > return a new path. Also rewrote the code a bit so the mutable inProgressPart > is changed in a single place -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12250) Rewrite assembleNewPartPath to let it return a new PartPath
Fokko Driesprong created FLINK-12250: Summary: Rewrite assembleNewPartPath to let it return a new PartPath Key: FLINK-12250 URL: https://issues.apache.org/jira/browse/FLINK-12250 Project: Flink Issue Type: Improvement Affects Versions: 1.8.0 Reporter: Fokko Driesprong Assignee: Fokko Driesprong While debugging some code, I've noticed assembleNewPartPath does not really return a new path. Also rewrote the code a bit so the mutable inProgressPart is only changed in a single function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (FLINK-12225) Simplify the interface of the PartFileWriter
[ https://issues.apache.org/jira/browse/FLINK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong reassigned FLINK-12225: Assignee: Fokko Driesprong > Simplify the interface of the PartFileWriter > > > Key: FLINK-12225 > URL: https://issues.apache.org/jira/browse/FLINK-12225 > Project: Flink > Issue Type: Improvement > Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The Path is not being used, so no sense in including it in the interface -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12225) Simplify the interface of the PartFileWriter
Fokko Driesprong created FLINK-12225: Summary: Simplify the interface of the PartFileWriter Key: FLINK-12225 URL: https://issues.apache.org/jira/browse/FLINK-12225 Project: Flink Issue Type: Improvement Reporter: Fokko Driesprong The Path is not being used, so no sense in including it in the interface -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11992) Update Apache Parquet 1.10.1
[ https://issues.apache.org/jira/browse/FLINK-11992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong closed FLINK-11992. Resolution: Won't Fix > Update Apache Parquet 1.10.1 > > > Key: FLINK-11992 > URL: https://issues.apache.org/jira/browse/FLINK-11992 > Project: Flink > Issue Type: Bug > Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11992) Update Apache Parquet 1.10.1
Fokko Driesprong created FLINK-11992: Summary: Update Apache Parquet 1.10.1 Key: FLINK-11992 URL: https://issues.apache.org/jira/browse/FLINK-11992 Project: Flink Issue Type: Bug Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) Reporter: Fokko Driesprong Assignee: Fokko Driesprong -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11883) Harmonize the version of maven-shade-plugin
[ https://issues.apache.org/jira/browse/FLINK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong closed FLINK-11883. Resolution: Won't Fix > Harmonize the version of maven-shade-plugin > --- > > Key: FLINK-11883 > URL: https://issues.apache.org/jira/browse/FLINK-11883 > Project: Flink > Issue Type: Bug > Components: Build System >Affects Versions: 1.7.2 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11911) KafkaTopicPartition is not a valid POJO
[ https://issues.apache.org/jira/browse/FLINK-11911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated FLINK-11911: - Affects Version/s: (was: 1.7.2) 1.8.0 > KafkaTopicPartition is not a valid POJO > --- > > Key: FLINK-11911 > URL: https://issues.apache.org/jira/browse/FLINK-11911 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.8.0 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.8.0 > > > KafkaTopicPartition is not a POJO, and therefore it cannot be serialized > efficiently. This is using the KafkaDeserializationSchema. > When enforcing POJO's: > ``` > java.lang.UnsupportedOperationException: Generic types have been disabled in > the ExecutionConfig and type > org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition is > treated as a generic type. > at > org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:86) > at > org.apache.flink.api.java.typeutils.TupleTypeInfo.createSerializer(TupleTypeInfo.java:107) > at > org.apache.flink.api.java.typeutils.TupleTypeInfo.createSerializer(TupleTypeInfo.java:52) > at > org.apache.flink.api.java.typeutils.ListTypeInfo.createSerializer(ListTypeInfo.java:102) > at > org.apache.flink.api.common.state.StateDescriptor.initializeSerializerUnlessSet(StateDescriptor.java:288) > at > org.apache.flink.runtime.state.DefaultOperatorStateBackend.getListState(DefaultOperatorStateBackend.java:289) > at > org.apache.flink.runtime.state.DefaultOperatorStateBackend.getUnionListState(DefaultOperatorStateBackend.java:219) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.initializeState(FlinkKafkaConsumerBase.java:856) > at > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178) > at > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:278) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:748) > ``` > And in the logs: > ``` > 2019-03-13 16:41:28,217 INFO > org.apache.flink.api.java.typeutils.TypeExtractor - class > org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition > does not contain a setter for field topic > 2019-03-13 16:41:28,221 INFO > org.apache.flink.api.java.typeutils.TypeExtractor - Class class > org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition > cannot be used as a POJO type because not all fields are valid POJO fields, > and must be processed as GenericType. Please read the Flink documentation on > "Data Types & Serialization" for details of the effect on performance. > ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11911) KafkaTopicPartition is not a valid POJO
Fokko Driesprong created FLINK-11911: Summary: KafkaTopicPartition is not a valid POJO Key: FLINK-11911 URL: https://issues.apache.org/jira/browse/FLINK-11911 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: 1.7.2 Reporter: Fokko Driesprong Assignee: Fokko Driesprong Fix For: 1.8.0 KafkaTopicPartition is not a POJO, and therefore it cannot be serialized efficiently. This is using the KafkaDeserializationSchema. When enforcing POJO's: ``` java.lang.UnsupportedOperationException: Generic types have been disabled in the ExecutionConfig and type org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition is treated as a generic type. at org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:86) at org.apache.flink.api.java.typeutils.TupleTypeInfo.createSerializer(TupleTypeInfo.java:107) at org.apache.flink.api.java.typeutils.TupleTypeInfo.createSerializer(TupleTypeInfo.java:52) at org.apache.flink.api.java.typeutils.ListTypeInfo.createSerializer(ListTypeInfo.java:102) at org.apache.flink.api.common.state.StateDescriptor.initializeSerializerUnlessSet(StateDescriptor.java:288) at org.apache.flink.runtime.state.DefaultOperatorStateBackend.getListState(DefaultOperatorStateBackend.java:289) at org.apache.flink.runtime.state.DefaultOperatorStateBackend.getUnionListState(DefaultOperatorStateBackend.java:219) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.initializeState(FlinkKafkaConsumerBase.java:856) at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178) at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:278) at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:748) ``` And in the logs: ``` 2019-03-13 16:41:28,217 INFO org.apache.flink.api.java.typeutils.TypeExtractor - class org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition does not contain a setter for field topic 2019-03-13 16:41:28,221 INFO org.apache.flink.api.java.typeutils.TypeExtractor - Class class org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance. ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11883) Harmonize the versions of the
Fokko Driesprong created FLINK-11883: Summary: Harmonize the versions of the Key: FLINK-11883 URL: https://issues.apache.org/jira/browse/FLINK-11883 Project: Flink Issue Type: Bug Reporter: Fokko Driesprong -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (FLINK-11883) Harmonize the version of maven-shade-plugin
[ https://issues.apache.org/jira/browse/FLINK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong reassigned FLINK-11883: Assignee: Fokko Driesprong > Harmonize the version of maven-shade-plugin > --- > > Key: FLINK-11883 > URL: https://issues.apache.org/jira/browse/FLINK-11883 > Project: Flink > Issue Type: Bug > Components: Build System >Affects Versions: 1.7.2 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11883) Harmonize the version of maven-shade-plugin
[ https://issues.apache.org/jira/browse/FLINK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated FLINK-11883: - Component/s: Build System > Harmonize the version of maven-shade-plugin > --- > > Key: FLINK-11883 > URL: https://issues.apache.org/jira/browse/FLINK-11883 > Project: Flink > Issue Type: Bug > Components: Build System >Affects Versions: 1.7.2 >Reporter: Fokko Driesprong >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11883) Harmonize the version of maven-shade-plugin
[ https://issues.apache.org/jira/browse/FLINK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated FLINK-11883: - Affects Version/s: 1.7.2 > Harmonize the version of maven-shade-plugin > --- > > Key: FLINK-11883 > URL: https://issues.apache.org/jira/browse/FLINK-11883 > Project: Flink > Issue Type: Bug >Affects Versions: 1.7.2 >Reporter: Fokko Driesprong >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11883) Harmonize the version of maven-shade-plugin
[ https://issues.apache.org/jira/browse/FLINK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated FLINK-11883: - Summary: Harmonize the version of maven-shade-plugin (was: Harmonize the versions of the ) > Harmonize the version of maven-shade-plugin > --- > > Key: FLINK-11883 > URL: https://issues.apache.org/jira/browse/FLINK-11883 > Project: Flink > Issue Type: Bug > Reporter: Fokko Driesprong >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-11378) Allow HadoopRecoverableWriter to write to Hadoop compatible Filesystems
[ https://issues.apache.org/jira/browse/FLINK-11378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785587#comment-16785587 ] Fokko Driesprong edited comment on FLINK-11378 at 3/6/19 12:55 PM: --- FLINK-11838 should supersede this ticket. was (Author: fokko): FLINK-11378 should supersede this ticket. > Allow HadoopRecoverableWriter to write to Hadoop compatible Filesystems > --- > > Key: FLINK-11378 > URL: https://issues.apache.org/jira/browse/FLINK-11378 > Project: Flink > Issue Type: Improvement > Components: FileSystems >Reporter: Martijn van de Grift >Assignee: Martijn van de Grift >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > At a client we're using Flink jobs to read data from Kafka and writing to > GCS. In earlier versions, we've used `BucketingFileSink` for this, but we > want to switch to the newer `StreamingFileSink`. > Since we're running Flink on Google's DataProc, we're using the Hadoop > compatible GCS > [connector|https://github.com/GoogleCloudPlatform/bigdata-interop] made by > Google. This currently doesn't work on Flink, because Flink checks for a HDFS > scheme at 'HadoopRecoverableWriter'. > We've successfully ran our jobs by creating a custom Flink Distro which has > the hdfs scheme check removed. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11838) Create RecoverableWriter for GCS
Fokko Driesprong created FLINK-11838: Summary: Create RecoverableWriter for GCS Key: FLINK-11838 URL: https://issues.apache.org/jira/browse/FLINK-11838 Project: Flink Issue Type: Improvement Components: FileSystems Affects Versions: 1.8.0 Reporter: Fokko Driesprong Assignee: Fokko Driesprong GCS supports the resumable upload which we can use to create a Recoverable writer similar to the S3 implementation: https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload After using the Hadoop compatible interface: https://github.com/apache/flink/pull/7519 We've noticed that the current implementation relies heavily on the renaming of the files on the commit: https://github.com/apache/flink/blob/master/flink-filesystems/flink-hadoop-fs/src/main/java/org/apache/flink/runtime/fs/hdfs/HadoopRecoverableFsDataOutputStream.java#L233-L259 This is suboptimal on an object store such as GCS. Therefore we would like to implement a more GCS native RecoverableWriter -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11378) Allow HadoopRecoverableWriter to write to Hadoop compatible Filesystems
[ https://issues.apache.org/jira/browse/FLINK-11378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785587#comment-16785587 ] Fokko Driesprong commented on FLINK-11378: -- FLINK-11378 should supersede this ticket. > Allow HadoopRecoverableWriter to write to Hadoop compatible Filesystems > --- > > Key: FLINK-11378 > URL: https://issues.apache.org/jira/browse/FLINK-11378 > Project: Flink > Issue Type: Improvement > Components: FileSystems >Reporter: Martijn van de Grift >Assignee: Martijn van de Grift >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > At a client we're using Flink jobs to read data from Kafka and writing to > GCS. In earlier versions, we've used `BucketingFileSink` for this, but we > want to switch to the newer `StreamingFileSink`. > Since we're running Flink on Google's DataProc, we're using the Hadoop > compatible GCS > [connector|https://github.com/GoogleCloudPlatform/bigdata-interop] made by > Google. This currently doesn't work on Flink, because Flink checks for a HDFS > scheme at 'HadoopRecoverableWriter'. > We've successfully ran our jobs by creating a custom Flink Distro which has > the hdfs scheme check removed. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11401) Allow compression on ParquetBulkWriter
[ https://issues.apache.org/jira/browse/FLINK-11401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749744#comment-16749744 ] Fokko Driesprong commented on FLINK-11401: -- Thanks for the comment [~StephanEwen] The RollOnCheckpoint behavior works very well for our use case, which is just ETL'ing the data from Kafka to a bucket. Since we're using an object store FS Backend (GCS), the renaming constant renaming of the files to `.in-progress` to `.pending` to `.avro` are far from optimal since renaming is very expensive. On HDFS this is a constant and atomic logic operation, in contrast when using an object store where this implies copying the whole file. In the near future, we'll open a PR for the Avro writer, implementing the BulkWriter. Since Avro is still in a container (we want to include the schema in the header of the file), we still need to write a header, before writing the actual rows. Writing this header first would require changing some interfaces. > Allow compression on ParquetBulkWriter > -- > > Key: FLINK-11401 > URL: https://issues.apache.org/jira/browse/FLINK-11401 > Project: Flink > Issue Type: Improvement > Components: Batch Connectors and Input/Output Formats >Affects Versions: 1.7.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: 1.8.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11401) Allow compression on ParquetBulkWriter
Fokko Driesprong created FLINK-11401: Summary: Allow compression on ParquetBulkWriter Key: FLINK-11401 URL: https://issues.apache.org/jira/browse/FLINK-11401 Project: Flink Issue Type: Improvement Affects Versions: 1.7.1 Reporter: Fokko Driesprong Assignee: Fokko Driesprong Fix For: 1.8.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11371) Close the AvroParquetReader after use
Fokko Driesprong created FLINK-11371: Summary: Close the AvroParquetReader after use Key: FLINK-11371 URL: https://issues.apache.org/jira/browse/FLINK-11371 Project: Flink Issue Type: Improvement Components: Formats Affects Versions: 1.7.1 Reporter: Fokko Driesprong Assignee: Fokko Driesprong Fix For: 1.8.0 The AvroParquetReader is not being closed -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11347) Optimize the ParquetAvroWriters factory
Fokko Driesprong created FLINK-11347: Summary: Optimize the ParquetAvroWriters factory Key: FLINK-11347 URL: https://issues.apache.org/jira/browse/FLINK-11347 Project: Flink Issue Type: Improvement Components: Formats Affects Versions: 1.7.1 Reporter: Fokko Driesprong Assignee: Fokko Driesprong Fix For: 1.8.0 In the ParquetAvroWriters the schema is first serialized to a string, and then back to a Schema, which is quite expensive to do. Therefore it makes sense to pass the schema to the writer directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11340) Bump commons-configuration from 1.7 to 1.10
Fokko Driesprong created FLINK-11340: Summary: Bump commons-configuration from 1.7 to 1.10 Key: FLINK-11340 URL: https://issues.apache.org/jira/browse/FLINK-11340 Project: Flink Issue Type: Improvement Components: Configuration, Core Reporter: Fokko Driesprong Assignee: Fokko Driesprong Bump commons-configuration from 1.7 to 1.10 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11339) Bump exec-maven-plugin from 1.5.0 to 1.6.0
Fokko Driesprong created FLINK-11339: Summary: Bump exec-maven-plugin from 1.5.0 to 1.6.0 Key: FLINK-11339 URL: https://issues.apache.org/jira/browse/FLINK-11339 Project: Flink Issue Type: Improvement Components: Build System Reporter: Fokko Driesprong Assignee: Fokko Driesprong Bump exec-maven-plugin from 1.5.0 to 1.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11338) Bump maven-enforcer-plugin from 3.0.0-M1 to 3.0.0-M2
Fokko Driesprong created FLINK-11338: Summary: Bump maven-enforcer-plugin from 3.0.0-M1 to 3.0.0-M2 Key: FLINK-11338 URL: https://issues.apache.org/jira/browse/FLINK-11338 Project: Flink Issue Type: Improvement Components: Build System Reporter: Fokko Driesprong Assignee: Fokko Driesprong Bump maven-enforcer-plugin from 3.0.0-M1 to 3.0.0-M2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11322) Use try-with-resource for FlinkKafkaConsumer010
Fokko Driesprong created FLINK-11322: Summary: Use try-with-resource for FlinkKafkaConsumer010 Key: FLINK-11322 URL: https://issues.apache.org/jira/browse/FLINK-11322 Project: Flink Issue Type: Improvement Components: Kafka Connector Affects Versions: 1.7.1 Reporter: Fokko Driesprong Assignee: Fokko Driesprong Fix For: 1.8.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11321) Clarify the NPE on fetching an nonexistent Kafka topic
Fokko Driesprong created FLINK-11321: Summary: Clarify the NPE on fetching an nonexistent Kafka topic Key: FLINK-11321 URL: https://issues.apache.org/jira/browse/FLINK-11321 Project: Flink Issue Type: Improvement Components: Kafka Connector Affects Versions: 1.7.1 Reporter: Fokko Driesprong Assignee: Fokko Driesprong Fix For: 1.8.0 Following exception isn't that descriptive: java.lang.NullPointerException at org.apache.flink.streaming.connectors.kafka.internal.Kafka09PartitionDiscoverer.getAllPartitionsForTopics(Kafka09PartitionDiscoverer.java:77) at org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:131) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:473) at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102) at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11306) Bump derby from 10.10.1.1 to 10.14.2.0
Fokko Driesprong created FLINK-11306: Summary: Bump derby from 10.10.1.1 to 10.14.2.0 Key: FLINK-11306 URL: https://issues.apache.org/jira/browse/FLINK-11306 Project: Flink Issue Type: Bug Components: Core Reporter: Fokko Driesprong Assignee: Fokko Driesprong -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11265) Invalid reference to AvroSinkWriter in example AvroKeyValueSinkWriter
Fokko Driesprong created FLINK-11265: Summary: Invalid reference to AvroSinkWriter in example AvroKeyValueSinkWriter Key: FLINK-11265 URL: https://issues.apache.org/jira/browse/FLINK-11265 Project: Flink Issue Type: Bug Affects Versions: 1.7.1 Reporter: Fokko Driesprong Assignee: Fokko Driesprong Fix For: 1.7.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11262) Bump jython-standalone to 2.7.1
Fokko Driesprong created FLINK-11262: Summary: Bump jython-standalone to 2.7.1 Key: FLINK-11262 URL: https://issues.apache.org/jira/browse/FLINK-11262 Project: Flink Issue Type: Improvement Components: Streaming Reporter: Fokko Driesprong Assignee: Fokko Driesprong Due to security issue: https://ossindex.sonatype.org/vuln/7a4be7b3-74f5-4a9b-a24f-d1fd80ed6bbca -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11260) Bump Janino compiler dependency
[ https://issues.apache.org/jira/browse/FLINK-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated FLINK-11260: - Description: Bump the Janino dependency: http://janino-compiler.github.io/janino/changelog.html (was: Bump the Janino depdency: http://janino-compiler.github.io/janino/changelog.html) > Bump Janino compiler dependency > --- > > Key: FLINK-11260 > URL: https://issues.apache.org/jira/browse/FLINK-11260 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.7.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: 1.7.2 > > Time Spent: 10m > Remaining Estimate: 0h > > Bump the Janino dependency: > http://janino-compiler.github.io/janino/changelog.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11260) Bump Janino compiler dependency
[ https://issues.apache.org/jira/browse/FLINK-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated FLINK-11260: - Description: Bump the Janino depdency: http://janino-compiler.github.io/janino/changelog.html (was: Bump the commons-compiler) > Bump Janino compiler dependency > --- > > Key: FLINK-11260 > URL: https://issues.apache.org/jira/browse/FLINK-11260 > Project: Flink > Issue Type: Improvement > Components: Core >Affects Versions: 1.7.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: 1.7.2 > > Time Spent: 10m > Remaining Estimate: 0h > > Bump the Janino depdency: > http://janino-compiler.github.io/janino/changelog.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11260) Bump Janino compiler dependency
Fokko Driesprong created FLINK-11260: Summary: Bump Janino compiler dependency Key: FLINK-11260 URL: https://issues.apache.org/jira/browse/FLINK-11260 Project: Flink Issue Type: Improvement Components: Core Affects Versions: 1.7.1 Reporter: Fokko Driesprong Assignee: Fokko Driesprong Fix For: 1.7.2 Bump the commons-compiler -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11259) Bump Zookeeper dependency to 3.4.13
Fokko Driesprong created FLINK-11259: Summary: Bump Zookeeper dependency to 3.4.13 Key: FLINK-11259 URL: https://issues.apache.org/jira/browse/FLINK-11259 Project: Flink Issue Type: Improvement Components: Cluster Management Affects Versions: 1.7.1 Reporter: Fokko Driesprong Assignee: Fokko Driesprong Fix For: 1.7.2 Bump Zookeeper to 3.4.13 https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11258) Add badge to the README
Fokko Driesprong created FLINK-11258: Summary: Add badge to the README Key: FLINK-11258 URL: https://issues.apache.org/jira/browse/FLINK-11258 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Fokko Driesprong Assignee: Fokko Driesprong I think we should add the badge to the docs to check if master is still happy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] flink issue #4072: [FLINK-6848] Update managed state docs
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/4072 Hi @tzulitai, I've added some more Scala examples. If you are still missing something, please let me know. Kind regards, Fokko --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #4072: [FLINK-6848] Update managed state docs
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/4072 Hi @tzulitai, I fully agree. Give me some time to work on the other Scala examples, I need to make sure that they are working properly. I'll wrap it up this week. I'll update the commit and rebase with master. Cheers, Fokko --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #4072: [FLINK-6848] Update managed state docs
GitHub user Fokko opened a pull request: https://github.com/apache/flink/pull/4072 [FLINK-6848] Update managed state docs Hi guys, I would like to add an example of how to work with managed state in Scala. The code is tested locally and might be a nice addition to the docs. Cheers, Fokko Driesprong Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration. If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide](http://flink.apache.org/how-to-contribute.html). In addition to going through the list, please provide a meaningful description of your changes. - [x] General - The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text") - The pull request addresses only one issue - Each commit in the PR has a meaningful commit message (including the JIRA id) - [x] Documentation - Documentation has been added for new functionality - Old documentation affected by the pull request has been updated - JavaDoc for public methods has been added - [x] Tests & Build - Functionality added by the pull request is covered by tests - `mvn clean verify` has been executed successfully locally or a Travis build has passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/Fokko/flink fd-update-raw-and-managed-state-docs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4072.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4072 commit 8103fc28d10d131eb1273dba4b477c25ac278bf0 Author: Fokko Driesprong Date: 2017-06-04T14:08:44Z Update managed state docs Add an example of how to work with managed state in Scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-6848) Extend the managed state docs with a Scala example
Fokko Driesprong created FLINK-6848: --- Summary: Extend the managed state docs with a Scala example Key: FLINK-6848 URL: https://issues.apache.org/jira/browse/FLINK-6848 Project: Flink Issue Type: Bug Reporter: Fokko Driesprong Hi all, It would be nice to add a Scala example code snippet in the Managed state docs. This makes it a bit easier to start using managed state in Scala. The code is tested and works. Kind regards, Fokko -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] flink pull request #3280: [Flink-5724] Error in documentation zipping elemen...
GitHub user Fokko opened a pull request: https://github.com/apache/flink/pull/3280 [Flink-5724] Error in documentation zipping elements Because the Scala tab is defined twice, it is not possible to open the Python tab. Please look at the documentation page itself: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/batch/zip_elements_guide.html I believe it is a copy-paste error (which also happens to me too often ;) Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration. If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide](http://flink.apache.org/how-to-contribute.html). In addition to going through the list, please provide a meaningful description of your changes. - [x] General - The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text") - The pull request addresses only one issue - Each commit in the PR has a meaningful commit message (including the JIRA id) - [x] Documentation - Documentation has been added for new functionality - Old documentation affected by the pull request has been updated - JavaDoc for public methods has been added - [x] Tests & Build - Functionality added by the pull request is covered by tests - `mvn clean verify` has been executed successfully locally or a Travis build has passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/Fokko/flink fd-fix-docs-zipping-elements Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3280.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3280 commit f490fbcd3633f77e87d89ba397c2b1ed1a030543 Author: Fokko Driesprong Date: 2017-02-06T20:08:31Z Error in documentation zipping elements Because the Scala tab is defined twice, it is not possible to open the Python tab. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-5724) Error in the 'Zipping Elements' docs
Fokko Driesprong created FLINK-5724: --- Summary: Error in the 'Zipping Elements' docs Key: FLINK-5724 URL: https://issues.apache.org/jira/browse/FLINK-5724 Project: Flink Issue Type: Bug Reporter: Fokko Driesprong The tab for the Python documentation isn't working because there are two tabs pointing at the Scala example. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] flink pull request #:
Github user Fokko commented on the pull request: https://github.com/apache/flink/commit/32e1675aa38eec4a15272d62977dfe3ddbe92401#commitcomment-20632930 Did not know this, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3077: [FLINK-5423] Implement Stochastic Outlier Selection
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/3077 @tillrohrmann I've added documentation about the algorithm. Can you check? Cheers, Fokko --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #3077: [FLINK-5423] Implement Stochastic Outlier Selectio...
Github user Fokko commented on a diff in the pull request: https://github.com/apache/flink/pull/3077#discussion_r96290365 --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/outlier/StochasticOutlierSelection.scala --- @@ -0,0 +1,367 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.outlier + +/** An implementation of the Stochastic Outlier Selection algorithm by Jeroen Jansen + * + * For more information about SOS, see https://github.com/jeroenjanssens/sos + * J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. Stochastic + * Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg University, + * Tilburg, the Netherlands, 2012. + * + * @example + * {{{ + * val inputDS = env.fromCollection(List( + * LabeledVector(0.0, DenseVector(1.0, 1.0)), + * LabeledVector(1.0, DenseVector(2.0, 1.0)), + * LabeledVector(2.0, DenseVector(1.0, 2.0)), + * LabeledVector(3.0, DenseVector(2.0, 2.0)), + * LabeledVector(4.0, DenseVector(5.0, 8.0)) // The outlier! + * )) + * + * val sos = StochasticOutlierSelection() + * .setPerplexity(3) + * + * val outputDS = sos.transform(inputDS) + * + * val expectedOutputDS = Array( + *0.2790094479202896, + *0.25775014551682535, + *0.22136130977995766, + *0.12707053787018444, + *0.9922779902453757 // The outlier! + * ) + * + * assert(outputDS == expectedOutputDS) + * }}} + * + * =Parameters= + * + * - [[org.apache.flink.ml.outlier.StochasticOutlierSelection.Perplexity]]: + * Perplexity can be interpreted as the k in k-nearest neighbor algorithms. The difference is that + * in SOS being a neighbor is not a binary property, but a probabilistic one. Should be between + * 1 and n-1, where n is the number of observations. + * (Default value: '''30''') + * + * - [[org.apache.flink.ml.outlier.StochasticOutlierSelection.ErrorTolerance]]: + * The accepted error tolerance. When increasing this number, it will sacrifice accuracy in + * return for reduced computational time. + * (Default value: '''1e-20''') + * + * - [[org.apache.flink.ml.outlier.StochasticOutlierSelection.MaxIterations]]: + * The maximum number of iterations to perform. (Default value: '''5000''') + */ + +import breeze.linalg.functions.euclideanDistance +import breeze.linalg.{sum, DenseVector => BreezeDenseVector, Vector => BreezeVector} +import org.apache.flink.api.common.operators.Order +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.api.scala.utils._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap, WithParameters} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{TransformDataSetOperation, Transformer} + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +class StochasticOutlierSelection extends Transformer[StochasticOutlierSelection] { + + import StochasticOutlierSelection._ + + + /** Sets the perplexity of the outlier selection algorithm, can be seen as the k of kNN +* For more information, please read the Stochastic Outlier Selection algorithm paper +* +* @param perplexity the perplexity of the affinity fit +* @return +*/ + def setPerplexity(perplexity: Double): StochasticOutlierSelection = { +requir
[GitHub] flink pull request #3077: [FLINK-5423] Implement Stochastic Outlier Selectio...
Github user Fokko commented on a diff in the pull request: https://github.com/apache/flink/pull/3077#discussion_r96290218 --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/outlier/StochasticOutlierSelection.scala --- @@ -0,0 +1,367 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.outlier + +/** An implementation of the Stochastic Outlier Selection algorithm by Jeroen Jansen + * + * For more information about SOS, see https://github.com/jeroenjanssens/sos + * J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. Stochastic + * Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg University, + * Tilburg, the Netherlands, 2012. + * + * @example + * {{{ + * val inputDS = env.fromCollection(List( + * LabeledVector(0.0, DenseVector(1.0, 1.0)), + * LabeledVector(1.0, DenseVector(2.0, 1.0)), + * LabeledVector(2.0, DenseVector(1.0, 2.0)), + * LabeledVector(3.0, DenseVector(2.0, 2.0)), + * LabeledVector(4.0, DenseVector(5.0, 8.0)) // The outlier! + * )) + * + * val sos = StochasticOutlierSelection() + * .setPerplexity(3) + * + * val outputDS = sos.transform(inputDS) + * + * val expectedOutputDS = Array( + *0.2790094479202896, + *0.25775014551682535, + *0.22136130977995766, + *0.12707053787018444, + *0.9922779902453757 // The outlier! + * ) + * + * assert(outputDS == expectedOutputDS) + * }}} + * + * =Parameters= + * + * - [[org.apache.flink.ml.outlier.StochasticOutlierSelection.Perplexity]]: + * Perplexity can be interpreted as the k in k-nearest neighbor algorithms. The difference is that + * in SOS being a neighbor is not a binary property, but a probabilistic one. Should be between + * 1 and n-1, where n is the number of observations. + * (Default value: '''30''') + * + * - [[org.apache.flink.ml.outlier.StochasticOutlierSelection.ErrorTolerance]]: + * The accepted error tolerance. When increasing this number, it will sacrifice accuracy in + * return for reduced computational time. + * (Default value: '''1e-20''') + * + * - [[org.apache.flink.ml.outlier.StochasticOutlierSelection.MaxIterations]]: + * The maximum number of iterations to perform. (Default value: '''5000''') + */ + +import breeze.linalg.functions.euclideanDistance +import breeze.linalg.{sum, DenseVector => BreezeDenseVector, Vector => BreezeVector} +import org.apache.flink.api.common.operators.Order +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.api.scala.utils._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap, WithParameters} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{TransformDataSetOperation, Transformer} + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +class StochasticOutlierSelection extends Transformer[StochasticOutlierSelection] { + + import StochasticOutlierSelection._ + + + /** Sets the perplexity of the outlier selection algorithm, can be seen as the k of kNN +* For more information, please read the Stochastic Outlier Selection algorithm paper +* +* @param perplexity the perplexity of the affinity fit +* @return +*/ + def setPerplexity(perplexity: Double): StochasticOutlierSelection = { +requir
[GitHub] flink pull request #3077: [FLINK-5423] Implement Stochastic Outlier Selectio...
Github user Fokko commented on a diff in the pull request: https://github.com/apache/flink/pull/3077#discussion_r96290195 --- Diff: flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/outlier/StochasticOutlierSelection.scala --- @@ -0,0 +1,367 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.outlier + +/** An implementation of the Stochastic Outlier Selection algorithm by Jeroen Jansen + * + * For more information about SOS, see https://github.com/jeroenjanssens/sos + * J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. Stochastic + * Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg University, + * Tilburg, the Netherlands, 2012. + * + * @example + * {{{ + * val inputDS = env.fromCollection(List( + * LabeledVector(0.0, DenseVector(1.0, 1.0)), + * LabeledVector(1.0, DenseVector(2.0, 1.0)), + * LabeledVector(2.0, DenseVector(1.0, 2.0)), + * LabeledVector(3.0, DenseVector(2.0, 2.0)), + * LabeledVector(4.0, DenseVector(5.0, 8.0)) // The outlier! + * )) + * + * val sos = StochasticOutlierSelection() + * .setPerplexity(3) + * + * val outputDS = sos.transform(inputDS) + * + * val expectedOutputDS = Array( + *0.2790094479202896, + *0.25775014551682535, + *0.22136130977995766, + *0.12707053787018444, + *0.9922779902453757 // The outlier! + * ) + * + * assert(outputDS == expectedOutputDS) + * }}} + * + * =Parameters= + * + * - [[org.apache.flink.ml.outlier.StochasticOutlierSelection.Perplexity]]: + * Perplexity can be interpreted as the k in k-nearest neighbor algorithms. The difference is that + * in SOS being a neighbor is not a binary property, but a probabilistic one. Should be between + * 1 and n-1, where n is the number of observations. + * (Default value: '''30''') + * + * - [[org.apache.flink.ml.outlier.StochasticOutlierSelection.ErrorTolerance]]: + * The accepted error tolerance. When increasing this number, it will sacrifice accuracy in + * return for reduced computational time. + * (Default value: '''1e-20''') + * + * - [[org.apache.flink.ml.outlier.StochasticOutlierSelection.MaxIterations]]: + * The maximum number of iterations to perform. (Default value: '''5000''') + */ + +import breeze.linalg.functions.euclideanDistance +import breeze.linalg.{sum, DenseVector => BreezeDenseVector, Vector => BreezeVector} +import org.apache.flink.api.common.operators.Order +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.api.scala.utils._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap, WithParameters} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{TransformDataSetOperation, Transformer} + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +class StochasticOutlierSelection extends Transformer[StochasticOutlierSelection] { + + import StochasticOutlierSelection._ + + + /** Sets the perplexity of the outlier selection algorithm, can be seen as the k of kNN +* For more information, please read the Stochastic Outlier Selection algorithm paper +* +* @param perplexity the perplexity of the affinity fit +* @return +*/ + def setPerplexity(perplexity: Double): StochasticOutlierSelection = { +requir
[GitHub] flink pull request #3077: [FLINK-5423] Implement Stochastic Outlier Selectio...
Github user Fokko commented on a diff in the pull request: https://github.com/apache/flink/pull/3077#discussion_r96289143 --- Diff: flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/outlier/StochasticOutlierSelectionITSuite.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.outlier + +import breeze.linalg.{sum, DenseVector => BreezeDenseVector} +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.LabeledVector +import org.apache.flink.ml.math.DenseVector +import org.apache.flink.ml.outlier.StochasticOutlierSelection.BreezeLabeledVector +import org.apache.flink.ml.util.FlinkTestBase +import org.scalatest.{FlatSpec, Matchers} + +class StochasticOutlierSelectionITSuite extends FlatSpec with Matchers with FlinkTestBase { + behavior of "Stochastic Outlier Selection algorithm" + val EPSILON = 1e-16 + + /* +Unit-tests created based on the Python scripts of the algorithms author' +https://github.com/jeroenjanssens/scikit-sos + +For more information about SOS, see https://github.com/jeroenjanssens/sos +J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. Stochastic +Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg University, +Tilburg, the Netherlands, 2012. + */ + + val perplexity = 3 + val errorTolerance = 0 + val maxIterations = 5000 + val parameters = new StochasticOutlierSelection().setPerplexity(perplexity).parameters + + val env = ExecutionEnvironment.getExecutionEnvironment + + it should "Compute the perplexity of the vector and return the correct error" in { +val vector = BreezeDenseVector(Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 9.0, 10.0)) + +val output = Array( + 0.39682901665799636, + 0.15747326846175236, + 0.06248996227359784, + 0.024797830280027126, + 0.009840498605275054, + 0.0039049953849556816, + 6.149323865970302E-4, + 2.4402301428445443E-4, + 9.683541280042027E-5 +) + +val search = StochasticOutlierSelection.binarySearch( + vector, + Math.log(perplexity), + maxIterations, + errorTolerance +).toArray + +search should be(output) + } + + it should "Compute the distance matrix and give symmetrical distances" in { + +val data = env.fromCollection(List( + BreezeLabeledVector(0, BreezeDenseVector(Array(1.0, 3.0))), + BreezeLabeledVector(1, BreezeDenseVector(Array(5.0, 1.0))) +)) + +val distanceMatrix = StochasticOutlierSelection + .computeDissimilarityVectors(data) + .map(_.data) + .collect() + .toArray + +print(distanceMatrix) --- End diff -- Oops, still in there from the debugging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3077: [FLINK-5423] Implement Stochastic Outlier Selection
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/3077 Thanks @tillrohrmann, excellent idea regarding the documentation. I'll also process the code comments, good feedback. Somewhere today or tomorrow I will fix this. Cheers, Fokko --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3081: [FLINK-5426] Clean up the Flink Machine Learning library
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/3081 I've rebased both branches with master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3081: [FLINK-5426] Clean up the Flink Machine Learning library
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/3081 Hi @tillrohrmann, did you find any time to check #3077 and this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3081: [FLINK-5426] Clean up the Flink Machine Learning library
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/3081 I think we have some flakey tests, since it passes on my own travis: https://travis-ci.org/Fokko/flink/builds/189854940 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3077: [FLINK-5423] Implement Stochastic Outlier Selection
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/3077 I think we have some flakey tests, since it passes on my own travis: https://travis-ci.org/Fokko/flink/builds/189855914 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3081: [FLINK-5426] Clean up the Flink Machine Learning library
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/3081 I think we have some flakey tests, since it passes on my own travis: https://travis-ci.org/Fokko/flink/builds/189855914 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #3081: [FLINK-5426] Clean up the Flink Machine Learning l...
GitHub user Fokko opened a pull request: https://github.com/apache/flink/pull/3081 [FLINK-5426] Clean up the Flink Machine Learning library Hi guys, I would like to contribute to the Flink ML library. I took the liberty to clean up some of the code and improve the scaladoc. Beside that I've implemented #3077 to get more familiar with the Flink API and I would love to contribute more in the future, in particular the machine learning library. If you have any questions, please let me know. Let me know if improvements to the ML library are appreciated in general. - [x] General - The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text") - The pull request addresses only one issue - Each commit in the PR has a meaningful commit message (including the JIRA id) - [x] Documentation - Documentation has been added for new functionality - Old documentation affected by the pull request has been updated - JavaDoc for public methods has been added - [x] Tests & Build - Functionality added by the pull request is covered by tests - `mvn clean verify` has been executed successfully locally or a Travis build has passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/Fokko/flink fd-cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3081.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3081 commit 013b22d7bcaf48c8e96983295fcc455faf0aa94b Author: Fokko Driesprong Date: 2017-01-06T20:34:53Z Removed duplicate tests, inproved scaladoc and naming, removed typo's in scaladoc, introduced and improved use of constants, improved test-case naming. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-5426) Clean up the Flink Machine Learning library
Fokko Driesprong created FLINK-5426: --- Summary: Clean up the Flink Machine Learning library Key: FLINK-5426 URL: https://issues.apache.org/jira/browse/FLINK-5426 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Fokko Driesprong Hi Guys, I would like to clean up the Machine Learning library. A lot of the code in the ML Library does not conform to the original contribution guide. For example: Duplicate tests, different names, but exactly the same testcase: https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/math/DenseVectorSuite.scala#L148 https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/math/DenseVectorSuite.scala#L164 Lot of multi-line tests-cases: https://github.com/Fokko/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/math/DenseVectorSuite.scala Mis-use of constants: https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/math/DenseMatrix.scala#L58 Please allow me to clean this up, and I'm looking forward to contribute more code, especially to the ML part. I've have been a contributor to Apache Spark and am happy to extend the codebase with new distributed algorithms and make the codebase more mature. Cheers, Fokko -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #3077: [FLINK-5423] Implement Stochastic Outlier Selectio...
GitHub user Fokko opened a pull request: https://github.com/apache/flink/pull/3077 [FLINK-5423] Implement Stochastic Outlier Selection Implemented the Stochastic Outlier Selection algorithm in the Machine Learning library, including the test code. Added documentation. Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration. If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide](http://flink.apache.org/how-to-contribute.html). In addition to going through the list, please provide a meaningful description of your changes. - [ ] General - The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text") - The pull request addresses only one issue - Each commit in the PR has a meaningful commit message (including the JIRA id) - [ ] Documentation - Documentation has been added for new functionality - Old documentation affected by the pull request has been updated - JavaDoc for public methods has been added - [ ] Tests & Build - Functionality added by the pull request is covered by tests - `mvn clean verify` has been executed successfully locally or a Travis build has passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/Fokko/flink fd-implement-stochastic-outlier-selection Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3077.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3077 commit a67b170f6eee6c053322a4730f1b8dcaa680a112 Author: Fokko Driesprong Date: 2016-12-30T06:38:52Z Implemented the Stochastic Outlier Selection algorithm in the Machine Learning library, including the test code. Added documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-5423) Implement Stochastic Outlier Selection
Fokko Driesprong created FLINK-5423: --- Summary: Implement Stochastic Outlier Selection Key: FLINK-5423 URL: https://issues.apache.org/jira/browse/FLINK-5423 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Fokko Driesprong I've implemented the Stochastic Outlier Selection (SOS) algorithm by Jeroen Jansen. http://jeroenjanssens.com/2013/11/24/stochastic-outlier-selection.html Integrated as much as possible with the components from the machine learning library. The algorithm itself has been compared to four other algorithms and it it shows that SOS has a higher performance on most of these real-world datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request #3048: Clarified the import path of the Breeze DenseVecto...
Github user Fokko closed the pull request at: https://github.com/apache/flink/pull/3048 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2442: [FLINK-4148] incorrect calculation minDist distance in Qu...
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/2442 Looks good, please merge. Should have been fixed long ago :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3052: Swap the pattern matching order
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/3052 Alright, just rebased with master. Looks like that the Travis is working again, good job! Cheers! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #3052: Swap the pattern matching order
Github user Fokko closed the pull request at: https://github.com/apache/flink/pull/3052 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #3052: Swap the pattern matching order
GitHub user Fokko opened a pull request: https://github.com/apache/flink/pull/3052 Swap the pattern matching order Swap the pattern matching order, because `EuclideanDistanceMetric extends SquaredEuclideanDistanceMetric extends DistanceMetric`, otherwise the EuclideanDistance cannot be executed: ``` [WARNING] /Users/fokkodriesprong/Desktop/flink-fokko/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala:106: warning: unreachable code [WARNING] case _: EuclideanDistanceMetric => math.sqrt(minDist) [WARNING] ^ [WARNING] warning: Class org.apache.log4j.Level not found - continuing with a stub. [WARNING] warning: there were 1 feature warning(s); re-run with -feature for details [WARNING] three warnings found ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/Fokko/flink fd-fix-pattern-matching Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3052.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3052 commit 29ebffc77cfbe917796f44764936972b578ebd38 Author: Fokko Driesprong Date: 2016-12-29T22:49:14Z Swap the pattern matching order, because EuclideanDistanceMetric extends SquaredEuclideanDistanceMetric extends DistanceMetric, otherwise the EuclideanDistance cannot be executed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3048: Clarified the import path of the Breeze DenseVector
Github user Fokko commented on the issue: https://github.com/apache/flink/pull/3048 Locally the tests just pass. Looking at the error logs, it doesn't have to do with the changes in the PR, for example: ``` java.io.FileNotFoundException: build-target/lib/flink-dist-*.jar (No such file or directory) at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.(ZipFile.java:220) at java.util.zip.ZipFile.(ZipFile.java:150) at java.util.zip.ZipFile.(ZipFile.java:121) at sun.tools.jar.Main.list(Main.java:1060) at sun.tools.jar.Main.run(Main.java:291) at sun.tools.jar.Main.main(Main.java:1233) find: `./flink-yarn-tests/target/flink-yarn-tests*': No such file or directory ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #3048: Clarified the import path of the Breeze DenseVecto...
GitHub user Fokko opened a pull request: https://github.com/apache/flink/pull/3048 Clarified the import path of the Breeze DenseVector Guys, I'm working on an extension of the ml library on Flink, but I stumbled upon this. Since it is such a trivial fix, I didn't created a JIRA request. Keep up the good work! Cheers, You can merge this pull request into a Git repository by running: $ git pull https://github.com/Fokko/flink fd-cleanup-package-structure Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3048.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3048 commit 3fd38fe9785d607a05d045cd54a05af9ed48e350 Author: Fokko Driesprong Date: 2016-12-27T14:43:15Z Replaced the full import path with the BreezeDenseVector itself to make it more readable --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---