[jira] [Created] (FLINK-9167) Approximate KNN with Incremental Insertion

2018-04-12 Thread Alex Klibisz (JIRA)
Alex Klibisz created FLINK-9167:
---

 Summary: Approximate KNN with Incremental Insertion
 Key: FLINK-9167
 URL: https://issues.apache.org/jira/browse/FLINK-9167
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Alex Klibisz


I'm new to Flink, and I'm curious about an extension of approximate KNN to 
support  incremental insertion to the index.

Consider the case where you build an index from a training set of vectors. As 
your application runs, you ingest a stream of new vectors (e.g. users posting 
new content). For every new vector, you compute its neighbors against the 
existing index. Then you immediately insert the new vector to the index such 
that it can be returned for subsequent queries.

Perhaps this is possible with current components of Flink, or maybe another 
streaming tool already has a comparable  implementation? If so, I would 
appreciate any pointers or links to examples.

If it's not available, is there interest in implementing such a feature? If so, 
I would be interested in making an attempt.

I appreciate any tips or insight. Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9166) Performance issue with Flink SQL

2018-04-12 Thread SUBRAMANYA SURESH (JIRA)
SUBRAMANYA SURESH created FLINK-9166:


 Summary: Performance issue with Flink SQL
 Key: FLINK-9166
 URL: https://issues.apache.org/jira/browse/FLINK-9166
 Project: Flink
  Issue Type: Bug
  Components: Table API  SQL
Affects Versions: 1.4.2
Reporter: SUBRAMANYA SURESH


With a high number of Flink SQL queries (100 of below), the Flink command line 
client fails with a "JobManager did not respond within 60 ms" on a Yarn 
cluster. 
JobManager logs has nothing after the last TaskManager started except DEBUG 
logs with "job with ID 5cd95f89ed7a66ec44f2d19eca0592f7 not found in 
JobManager", indicating its likely stuck (creating the ExecutionGraph?). 

The same works as standalone java program locally (high CPU initially)

Note: Each Row in structStream contains 515 columns (many end up null) 
including a column that has the raw message.

In the YARN cluster we specify 18GB for TaskManager, 18GB for the JobManager, 5 
slots each and parallelism of 725 (partitions in our Kafka source).

*Query:*
{code:java}
 select count (*), 'idnumber' as criteria, Environment, CollectedTimestamp, 
EventTimestamp, RawMsg, Source 
 from structStream
 where Environment='MyEnvironment' and Rule='MyRule' and LogType='MyLogType' 
and Outcome='Success'
 group by tumble(proctime, INTERVAL '1' SECOND), Environment, 
CollectedTimestamp, EventTimestamp, RawMsg, Source
{code}
*Code:*
{code:java}
public static void main(String[] args) throws Exception {
 
FileSystems.newFileSystem(KafkaReadingStreamingJob.class.getResource(WHITELIST_CSV).toURI(),
 new HashMap<>());

 final StreamExecutionEnvironment streamingEnvironment = 
getStreamExecutionEnvironment();
 final StreamTableEnvironment tableEnv = 
TableEnvironment.getTableEnvironment(streamingEnvironment);

 final DataStream structStream = 
getKafkaStreamOfRows(streamingEnvironment);
 tableEnv.registerDataStream("structStream", structStream);
 tableEnv.scan("structStream").printSchema();

 for (int i = 0; i < 100; i++){
   for (String query : Queries.sample){
 // Queries.sample has one query that is above. 
 Table selectQuery = tableEnv.sqlQuery(query);

 DataStream selectQueryStream = tableEnv.toAppendStream(selectQuery,  
Row.class);
 selectQueryStream.print();
   }
 }

 // execute program
 streamingEnvironment.execute("Kafka Streaming SQL");
}

private static DataStream getKafkaStreamOfRows(StreamExecutionEnvironment 
environment) throws Exception {
  Properties properties = getKafkaProperties();
  // TestDeserializer deserializes the JSON to a ROW of string columns (515)
  // and also adds a column for the raw message. 
  FlinkKafkaConsumer011 consumer = new   
FlinkKafkaConsumer011(KAFKA_TOPIC_TO_CONSUME, new
TestDeserializer(getRowTypeInfo()), properties);
 DataStream stream = environment.addSource(consumer);

 return stream;
}

private static RowTypeInfo getRowTypeInfo() throws Exception {
  // This has 515 fields. 
  List fieldNames = DDIManager.getDDIFieldNames();
  fieldNames.add("rawkafka"); // rawMessage added by TestDeserializer
  fieldNames.add("proctime");

 // Fill typeInformationArray with StringType to all but the last field which   
is of type Time
  .
  return new RowTypeInfo(typeInformationArray, fieldNamesArray);
}

private static StreamExecutionEnvironment getStreamExecutionEnvironment() 
throws IOException {
  final StreamExecutionEnvironment env =  
StreamExecutionEnvironment.getExecutionEnvironment(); 
   env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);

   env.enableCheckpointing(6);
   env.setStateBackend(new FsStateBackend(CHECKPOINT_DIR));
   env.setParallelism(725);
   return env;
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [jira] [Created] (FLINK-9164) times(#,#) quantifier does not seem to work

2018-04-12 Thread Kostas Kloudas
Hi Romain,

What if you remove the AfterMatchSkipStrategy.skipPastLastEvent()?

Kostas

> On Apr 12, 2018, at 1:19 PM, Romain Revol (JIRA)  wrote:
> 
> AfterMatchSkipStrategy.skipPastLastEvent()



Re: Using Slack for online discussions

2018-04-12 Thread Aljoscha Krettek
-1

I would also be in favour of not adding a Slack channel. I see that it can be 
useful for people because it's a little more direct but I think that having 
Slack open and also receiving questions and contacts from people there adds 
more burden on developers that we shouldn't add. Plus, I think having a 
searchable mail archive is really nice and moving some discussion to slack 
breaks that.

> On 3. Apr 2018, at 08:17, Suneel Marthi  wrote:
> 
> *As one of the admins for ASF slack*, let me clarify a few things:
> 
> 1. ASF slack is open to everyone regardless of apache,org or not - u need
> to make a request on dev@ to be added to ASF slack if you don't have an
> apache.org
> 
> 2. There is an archive bot in place on ASF slack for archiving older
> messages and keeping out to the free tier 1 messages
> 
> 3. I am with Till Rohrman, Chesnay and others here who have already
> expressed their -1 and am strong -1 too regarding moving Flink convos to a
> slack channel - *this project has been very exemplary in being transparent
> and community-driven from Day 1 as an Apache podling* - I really don't see
> a reason for Flink to be moving to slack.
> 
> My -1 again to slack move.
> 
> 
> 
> On Tue, Apr 3, 2018 at 11:01 AM, Ted Yu  wrote:
> 
>> bq. A bot could archive the messages to a web site, or forward to the email
>> list
>> 
>> I think some formatting / condensing may be needed if communication on
>> Slack is forwarded to mailing list - since the sentences on Slack may not
>> be as polished as on the emails.
>> There is also the mapping between a person's identity on Slack versus on
>> email.
>> 
>> FYI
>> 
>> On Tue, Apr 3, 2018 at 7:56 AM, TechnoMage  wrote:
>> 
>>> We use Slack in many contexts, company, community, etc.  It has many
>>> advantages over email.  For one being a separate channel from general
>> email
>>> it stands out when there are new messages.  Notifications can be
>> configured
>>> separately for each channel, and can arrive on multiple mobile devices
>> with
>>> synchronization between them.  A bot could archive the messages to a web
>>> site, or forward to the email list.  It also allows upload of code
>> snippets
>>> with formatting and voice/screen sharing where appropriate.  I would love
>>> to see it a supported platform.
>>> 
>>> Michael
>>> 
 On Apr 3, 2018, at 7:52 AM, Thomas Weise  wrote:
 
 The invite link is self service. Everyone can signup.
 
 As for the searchable record, it may be possible to archive what's
>> posted
 on the slack channel by subscribing the mailing list.
 
 I think a communication platform like Slack or IRC complements email,
>> the
 type of messages there would typically be different from email threads.
 
 Thanks,
 Thomas
 
 
 On Tue, Apr 3, 2018 at 7:37 AM, Ted Yu  wrote:
 
> It is the lack of searchable public record (of Slack) that we should
>> be
> concerned with.
> 
> Also, requiring invitation would be bottleneck for the growth of
> participants.
> 
> Cheers
> 
> On Tue, Apr 3, 2018 at 6:11 AM, Till Rohrmann 
> wrote:
> 
>> I'm a bit torn here. On the one hand I think Slack would be nice
>>> because
> it
>> allows a more direct interaction. Similar to the IRC channel we once
>>> had.
>> 
>> On the other hand, I fear that some information/discussions might get
> lost
>> in the depths of Slack and at least after the 1 message limit has
> been
>> reached. Posting these things on the ML allows to persist the
>>> information
>> in the ML archives. Moreover, it would discourage to some extent the
> usage
>> of the ML in general which is not in the sense of the ASF.
>> 
>> The problem that only ASF committers have access to the ASF slack
>>> channel
>> can be solved by an explicit invite for everyone [1].
>> 
>> [1] https://s.apache.org/slack-invite
>> 
>> Cheers,
>> Till
>> 
>> 
>> On Tue, Apr 3, 2018 at 3:09 PM, Ted Yu  wrote:
>> 
>>> Thanks for the background information.
>>> I withdraw previous +1
>>>  Original message From: Chesnay Schepler <
>>> ches...@apache.org> Date: 4/3/18  4:50 AM  (GMT-08:00) To:
>>> dev@flink.apache.org Subject: Re: Using Slack for online
>> discussions
>>> -1
>>> 
>>> 1. According to INFRA-14292
>>>    the ASF
>> Slack
>>>   isn't run by the ASF. This alone puts this service into rather
>>>   questionable territory as it /looks/ like an official ASF
>> service.
>>>   If anyone can provide information to the contrary, please do so.
>>> 2. We already discuss things on the mailing lists, JIRA and GitHub.
> All
>>>   of these are available to the 

[jira] [Created] (FLINK-9165) Improve CassandraSinkBase to send Collections

2018-04-12 Thread Christopher Hughes (JIRA)
Christopher Hughes created FLINK-9165:
-

 Summary: Improve CassandraSinkBase to send Collections
 Key: FLINK-9165
 URL: https://issues.apache.org/jira/browse/FLINK-9165
 Project: Flink
  Issue Type: Improvement
  Components: Cassandra Connector
Affects Versions: 1.4.2
Reporter: Christopher Hughes


The CassandraSinkBase can currently only handle individual objects.  I propose 
overloading the `send(IN value)` method to include `send(Collection 
value)`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9164) times(#,#) quantifier does not seem to work

2018-04-12 Thread Romain Revol (JIRA)
Romain Revol created FLINK-9164:
---

 Summary: times(#,#) quantifier does not seem to work
 Key: FLINK-9164
 URL: https://issues.apache.org/jira/browse/FLINK-9164
 Project: Flink
  Issue Type: Bug
  Components: CEP
Affects Versions: 1.4.2
 Environment: Windows 10 Pro 64-bit

Core i7-6820HQ @ 2.7 GHz

16GB RAM

Flink 1.4.2

Scala client

Scala 2.11.12
Reporter: Romain Revol


Assuming the following piece of code :
{code:java}
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)

val inputs = env.fromElements((1L, 'a'), (20L, 'a'), (22L, 'a'), (30L, 'a'), 
(40L, 'a'))
.assignAscendingTimestamps(_._1)

val pattern = Pattern.begin[(Long, Char)]("Start", 
AfterMatchSkipStrategy.skipPastLastEvent())
.where(_._2 == 'a').times(1,2)

CEP.pattern(inputs, pattern).select(_("Start")).addSink(println(_))

env.execute("Test"{code}
 

This results in
{code:java}
Buffer((1,a))
Buffer((20,a))
Buffer((22,a))
Buffer((30,a))
Buffer((40,a)){code}
While I would expect
{code:java}
Buffer((1,a), (20,a))
Buffer((22,a), (30,a))
Buffer((40,a){code}
My purpose is to match events by pair if possible, or alone if not. Note that 
adding greedy does nothing mode but this may be due to 
https://issues.apache.org/jira/browse/FLINK-8914.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Releasing Flink 1.5.0

2018-04-12 Thread Christophe Jolif
Hi all,

A small patch: https://github.com/apache/flink/pull/5789 (JIRA:
https://issues.apache.org/jira/browse/FLINK-9103) was issued to help with
SSL certificates in Kubernetes deployment where you don't control your IPs.
As this is very small and helpful (at least to me and Edward who issued the
fix), I was wondering if that could be considered for 1.5?

Thanks,
--
Christophe


On Mon, Mar 12, 2018 at 12:42 PM, Till Rohrmann 
wrote:

> Hi Pavel,
>
> currently, it is extremely difficult to say when it will happen since Flink
> 1.5 includes some very big changes which need thorough testing. Depending
> on that and what else the community finds on the way, it may go faster or
> slower. Personally, I hope to finish the release until end of
> March/beginning of April.
>
> Cheers,
> Till
>
> On Thu, Mar 8, 2018 at 7:28 PM, Pavel Ciorba  wrote:
>
> > Approximately when is the release of Flink 1.5 planned?
> >
> > Best,
> >
> > 2018-03-01 11:20 GMT+02:00 Till Rohrmann :
> >
> > > Thanks for bringing this issue up Shashank. I think Aljoscha is taking
> a
> > > look at the issue. It looks like a serious bug which we should
> definitely
> > > fix. What I've heard so far is that it's not so trivial.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Thu, Mar 1, 2018 at 9:56 AM, shashank734 
> > wrote:
> > >
> > > > Can we have
> > > > https://issues.apache.org/jira/browse/FLINK-7756
> > > >    solved in this
> > > > version.
> > > > Cause unable to use checkpointing with CEP and RocksDB backend.
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from: http://apache-flink-mailing-list-archive.1008284.n3.
> > > nabble.com/
> > > >
> > >
> >
>


Re: Multi-stream question

2018-04-12 Thread Fabian Hueske
Hi,

Ken's approach of having a joint data type and unioning the streams is
good. This will work seamlessly with checkpoints. Timo (in CC) used the
same approach to implement a prototype of a multi-way join.

A Tuple won't work though because the Tuple serializer does not support
null fields. You can use a Row or implement a custom, Either-like type.

Best, Fabian


TechnoMage  schrieb am Sa., 7. Apr. 2018, 17:25:

> Thanks for the Tuple suggestion, I may use that.  I was asking about
> building a custom operator (just an idea).  I have since decided I can
> decompose the problem into pairs of streams and emit a stream to the next
> CoFlatMap to get the result I need.  Now to see if the idea works ...
>
> Michael
>
> > On Apr 7, 2018, at 1:10 PM, Ken Krugler 
> wrote:
> >
> > Hi Michael,
> >
> > There isn’t an operator that takes three (or more) streams, AFAIK.
> >
> > There is a CoFlatMapFunction that takes two different streams in, which
> could be used for some types of joins.
> >
> > Streaming joins are (typically) windowed (bounded), by
> time/count/something, so if you can maintain the required windowed state in
> a ProcessFunction then you can implement whatever custom logic is required
> for your join case.
> >
> > And for creating a unioned stream of multiple data types, one easy way
> is via (e.g.) Tuple3, where only one of the three
> fields is non-null for each tuple.
> >
> > -- Ken
> >
> > PS - I think the u...@flink.apache.org 
> list is probably a better forum for this question.
> >
> >> On Apr 7, 2018, at 10:47 AM, TechnoMage  wrote:
> >>
> >> In my case I have more elaborate logic to select data from the
> streams.  They are not all the same logical type, though I may be able to
> represent them as the same Java type.  My main question is whether it is
> technically feasible to have a single operator that takes multiple streams
> as input.  For example Operator(stream1, stream2, stream3) and produces an
> output stream.  Can the checkpointing and other logic accomodate this if I
> write sufficient custom code in the operator?
> >>
> >> Michael
> >>
> >>> On Apr 7, 2018, at 10:42 AM, Ken Krugler 
> wrote:
> >>>
> >>> When you say “join” are you talking about a real join (so one or more
> fields can be used as a joining key), or some other operation?
> >>>
> >>> For more than two streams, you can do cascading window joins via
> multiple join()s that reduce your source streams down to a single stream.
> >>>
> >>> If the fields are the same across these streams, then a union()
> followed by say a ProcessFunction that implements your joining logic could
> work.
> >>>
> >>> Or you can convert all the streams to a common tuple format that
> consists of a unions the fields, so you can do a union() and then follow
> that with whatever logic is needed to actually do the join.
> >>>
> >>> Though I’m sure there are more elegant approaches :)
> >>>
> >>> — Ken
> >>>
> >>>
> >>>
>  On Apr 6, 2018, at 5:04 PM, Michael Latta 
> wrote:
> 
>  I would like to “join” several streams (>3) in a custom operator. Is
> this feasible in Flink?
> 
> 
>  Michael
> >>>
> >>> 
> >>> http://about.me/kkrugler
> >>> +1 530-210-6378
> >>>
> >>
> >
> > 
> > http://about.me/kkrugler
> > +1 530-210-6378
> >
>
>


[jira] [Created] (FLINK-9163) Harden e2e tests' signal traps and config restoration during abort

2018-04-12 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-9163:
--

 Summary: Harden e2e tests' signal traps and config restoration 
during abort
 Key: FLINK-9163
 URL: https://issues.apache.org/jira/browse/FLINK-9163
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0, 1.6.0, 1.5.1
Reporter: Nico Kruber
Assignee: Nico Kruber
 Fix For: 1.5.0


Signal traps on certain systems, e.g. Linux, may be called concurrently when 
the trap is caught during its own execution. In that case, our cleanup may just 
be wrong and overly eagerly also deleting {{flink-conf.yaml}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9162) Scala REPL hanging when running example

2018-04-12 Thread JIRA
陈梓立 created FLINK-9162:
--

 Summary: Scala REPL hanging when running example
 Key: FLINK-9162
 URL: https://issues.apache.org/jira/browse/FLINK-9162
 Project: Flink
  Issue Type: Bug
  Components: Scala Shell
Affects Versions: 1.5.1
 Environment: {code:java}
➜  build-target git:(master) scala -version
Scala code runner version 2.12.4 -- Copyright 2002-2017, LAMP/EPFL and 
Lightbend, Inc.
➜  build-target git:(master) java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
{code}

With GitHub latest SNAPSHOT
Reporter: 陈梓立


{code:java}
➜  build-target git:(master) bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host localhost.
Starting taskexecutor daemon on host localhost.
➜  build-target git:(master) bin/start-scala-shell.sh local
Starting Flink Shell:
log4j:WARN No appenders could be found for logger 
(org.apache.flink.configuration.GlobalConfiguration).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See [http://logging.apache.org/log4j/1.2/faq.html#noconfig] for more 
info.

Starting local Flink cluster (host: localhost, port: 49592). 

Connecting to Flink cluster (host: localhost, port: 49592).

...

scala> val dataStream = senv.fromElements(1, 2, 3, 4)
dataStream: org.apache.flink.streaming.api.scala.DataStream[Int] = 
org.apache.flink.streaming.api.scala.DataStream@6b576ff8

scala> dataStream.countWindowAll(2).sum(0).print()
res0: org.apache.flink.streaming.api.datastream.DataStreamSink[Int] = 
org.apache.flink.streaming.api.datastream.DataStreamSink@304e1e4e

scala> val text = benv.fromElements(
     |   "To be, or not to be,--that is the question:--",
     |   "Whether 'tis nobler in the mind to suffer",
     |   "The slings and arrows of outrageous fortune",
     |   "Or to take arms against a sea of troubles,")
text: org.apache.flink.api.scala.DataSet[String] = 
org.apache.flink.api.scala.DataSet@1237aa73

scala> val counts = text .flatMap \{ _.toLowerCase.split("\\W+") } .map \{ (_, 
1) }.groupBy(0).sum(1)
counts: org.apache.flink.api.scala.AggregateDataSet[(String, Int)] = 
org.apache.flink.api.scala.AggregateDataSet@7dbf92aa
 
scala> counts.print()
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9161) Support AS STRUCT syntax to create named STRUCT in SQL

2018-04-12 Thread Shuyi Chen (JIRA)
Shuyi Chen created FLINK-9161:
-

 Summary: Support AS STRUCT syntax to create named STRUCT in SQL
 Key: FLINK-9161
 URL: https://issues.apache.org/jira/browse/FLINK-9161
 Project: Flink
  Issue Type: New Feature
  Components: Table API  SQL
Reporter: Shuyi Chen
Assignee: Shuyi Chen


As discussed in [calcite dev mailing 
list|https://mail-archives.apache.org/mod_mbox/calcite-dev/201804.mbox/%3cCAMZk55avGNmp1vXeJwA1B_a8bGyCQ9ahxmE=R=6fklpf7jt...@mail.gmail.com%3e],
 we want add support for adding named structure construction in SQL, e.g., 

{code:java}
SELECT STRUCT(a as first_name, b as last_name, STRUCT(c as zip code, d as
street, e as state) as address) as record FROM example_table
{code}

This would require adding necessary change in Calcite first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)