Re: Flink Standalone cluster - logging problem

2019-02-10 Thread simpleusr
Hi Chesnay,

below is the content for my log4j-cli.properties file. I expect my job logs
(packaged under com.mycompany.xyz to be written to file2 appender. However
no file generated with prefix XYZ. I restarted the cluster , canceled
resubmitted several times but none of them helped.


/
log4j.rootLogger=INFO, file

# Log all infos in the given file
log4j.appender.file=org.apache.log4j.FileAppender
log4j.appender.file.file=${log.file}
log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{-MM-dd HH:mm:ss,SSS}
%-5p %-60c %x - %m%n

log4j.appender.file2=org.apache.log4j.FileAppender
log4j.appender.file2.file=XYZ-${log.file}
log4j.appender.file2.append=false
log4j.appender.file2.layout=org.apache.log4j.PatternLayout
log4j.appender.file2.layout.ConversionPattern=%d{-MM-dd HH:mm:ss,SSS}
%-5p %-60c %x - %m%n


# Log output from org.apache.flink.yarn to the console. This is used by the
# CliFrontend class when using a per-job YARN cluster.
log4j.logger.org.apache.flink.yarn=INFO, console
log4j.logger.org.apache.flink.yarn.cli.FlinkYarnSessionCli=INFO, console
log4j.logger.org.apache.hadoop=INFO, console

log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{-MM-dd HH:mm:ss,SSS}
%-5p %-60c %x - %m%n

# suppress the warning that hadoop native libraries are not loaded
(irrelevant for the client)
log4j.logger.org.apache.hadoop.util.NativeCodeLoader=OFF

# suppress the irrelevant (wrong) warnings from the netty channel handler
log4j.logger.org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline=ERROR,
file


log4j.logger.com.hazelcast=INFO, file2
log4j.logger.com.mycompany.xyz=DEBUG, file2/



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Couldn't figure out - How to do this in Flink? - Pls assist with suggestions

2019-02-10 Thread Titus Rakkesh
Thanks Chesnay. I will try that and let you know.

Thanks.

On Sun, Feb 10, 2019 at 2:31 PM Chesnay Schepler  wrote:

> You should be able to use a KeyedProcessFunction
> for
> that.
> Find matching elements via keyBy() on the first field.
> Aggregate into ValueState, send alert if necessary.
> Upon encountering a new key, setup a timer to remove the entry in 24h.
>
> On 08.02.2019 07:43, Titus Rakkesh wrote:
>
> Dears,
>
> I have a data stream continuously coming,
>
> DataStream> splitZTuple;
>
> Eg  - (775168263,113182,0.0)
>
> I have to store this for 24 hrs expiry in somewhere (Window or somewhere)
> to check against another stream.
>
> The second stream is
>
> DataStream> splittedVomsTuple which also
> continuously receiving one.
>
> Eg. (775168263,100.0)
>
>
> We need to accumulate the third element in (775168263,113182,*0.0*) in
> the WINDOW (If the corresponding first element match happened with the
> incoming second streams second element 775168263,*100.0*)
>
> While keeping this WINDOW session if any (775168263,113182,*175*) third
> element in the Window Stream exceed a value (Eg >150) we need to call back
> a another rest point to send an alert --- (775168263,113182,*175*)
> match the criteria. Simply a CEP call back.
>
>
> In Flink how we can do this kind of operations? Or do I need to think
> about any other framework? Please advise.
>
> Thanks...
>
>
>


Re: Flink Standalone cluster - dumps

2019-02-10 Thread simpleusr
Hi Chesnay,

Many thanks..



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Flink Standalone cluster - production settings

2019-02-10 Thread simpleusr
I know this seems a silly question but I am trying to figure out optimal set
up for our flink jobs. 
We are using standalone cluster with 5 jobs. Each job has 3 asynch operators
with Executors with thread counts of 20,20,100. Source is kafka and
cassandra and rest sinks exist.
Currently we are using parallelism = 1.  So, at max load a single job spans
at least 140 threads. Also we are using netty based libraries for cassandra
and restcalls . (As I can see in thread dump flink also uses netty server).

What we see is that total thread count adds up to ~ 500 for a single job.

The issue we faced is, all of a sudden all jobs began to fail in production
and we saw that it was mainly due to ulimit user process. All jobs did
started in one server in cluster ( I do not know why, as it is a cluster
with 3 members).

It was set to around 1500 in that server. We then set a higher value and
problems seem to go away.

Can you recommend an optional prod setting for standalone cluster? Or should
there be a max limit on threads spawned by a single job?

Regards



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Flink 1.7 Notebook Environment

2019-02-10 Thread Jeff Zhang
Hi Faizan,

I have implemented one flink interpreter for blink which is donated by
alibaba to flink community recently. Maybe you notice this news recently.

Here's some tutorials which you may be interested.

https://flink-china.org/doc/blink/ops/zeppelin.html
https://flink-china.org/doc/blink/quickstart/zeppelin_quickstart.html

And here's the code base: https://github.com/zjffdu/zeppelin/tree/blink_poc


Faizan Ahmed  于2019年2月11日周一 上午11:44写道:

> Hi all,
> I have been searching around quite a bit and doing my own experiments to
> make the latest Flink release 1.7.1 to work with Apache Zeppelin however
> Apache Zeppelin's Flink interpreter is quite outdated (Flink 1.3). AFAIK
> its not possible to use Flink running on YARN via Zeppelin as it only works
> with a local cluster.
>
> Has anyone been able to run Flink's latest release on Zeppelin? If yes
> then please share some instructions/tutorial. If no then is there any other
> suitable notebook environment for running Flink (maybe Jupyter)? I want to
> prototype my ideas in Flink and since I'm coming from Spark background it
> would be really useful to have notebook environment for vaildation of flink
> apps.
>
> Looking forward to your response
>
> Thanks
>


-- 
Best Regards

Jeff Zhang


Flink 1.7 Notebook Environment

2019-02-10 Thread Faizan Ahmed
Hi all,
I have been searching around quite a bit and doing my own experiments to
make the latest Flink release 1.7.1 to work with Apache Zeppelin however
Apache Zeppelin's Flink interpreter is quite outdated (Flink 1.3). AFAIK
its not possible to use Flink running on YARN via Zeppelin as it only works
with a local cluster.

Has anyone been able to run Flink's latest release on Zeppelin? If yes then
please share some instructions/tutorial. If no then is there any other
suitable notebook environment for running Flink (maybe Jupyter)? I want to
prototype my ideas in Flink and since I'm coming from Spark background it
would be really useful to have notebook environment for vaildation of flink
apps.

Looking forward to your response

Thanks


Re: Help with a stream processing use case

2019-02-10 Thread Tzu-Li (Gordon) Tai
Hi,

If Firehouse already supports sinking records from a Kinesis stream to an
S3 bucket, then yes, Chesnay's suggestion would work.
You route each record to a specific Kinesis stream depending on some value
in the record using the  KinesisSerializationSchema, and sink each Kinesis
stream to their target S3 bucket.

Another obvious approach is to use side output tags in the Flink job to
route records to different streaming file sinks that write to their own S3
buckets, but that would require knowing the target S3 buckets upfront.

Cheers,
Gordon

On Sun, Feb 10, 2019 at 5:42 PM Chesnay Schepler  wrote:

> I'll need someone else to chime in here for a definitive answer (cc'd
> Gordon), so I'm really just guessing here.
>
> For the partitioning it looks like you can use a custom partitioner, see
> FlinkKinesisProducer#setCustomPartitioner.
> Have you looked at the KinesisSerializationSchema described in the
> documentation
> ?
> It allows you to write to a specific stream based on incoming events, but
> I'm not sure whether this translates to S3 buckets and keyspaces.
>
> On 08.02.2019 16:43, Sandybayev, Turar (CAI - Atlanta) wrote:
>
> Hi all,
>
>
>
> I wonder whether it’s possible to use Flink for the following requirement.
> We need to process a Kinesis stream and based on values in each record,
> route those records to different S3 buckets and keyspaces, with support for
> batching up of files and control over partitioning scheme (so preferably
> through Firehose).
>
>
>
> I know it’s straightforward to have a Kinesis source and a Kinesis sink,
> and the hook up Firehose to the sink from AWS, but I need a “fan out” to
> potentially thousands of different buckets, based on content of each event.
>
>
>
> Thanks!
>
> Turar
>
>
>
>
>
>
>


Re: fllink 1.7.1 and RollingFileSink

2019-02-10 Thread Vishal Santoshi
Any one ?

On Sun, Feb 10, 2019 at 2:07 PM Vishal Santoshi 
wrote:

> You don't have to. Thank you for the input.
>
> On Sun, Feb 10, 2019 at 1:56 PM Timothy Victor  wrote:
>
>> My apologies for not seeing your use case properly.   The constraint on
>> rolling policy is only applicable for bulk formats such as Parquet as
>> highlighted in the docs.
>>
>> As for your questions, I'll have to defer to others more familiar with
>> it.   I mostly just use bulk formats such as avro and parquet.
>>
>> Tim
>>
>>
>> On Sun, Feb 10, 2019, 12:40 PM Vishal Santoshi > wrote:
>>
>>> That said the in the DefaultRollingPolicy it seems the check is on the
>>> file size ( mimics the check shouldRollOnEVent()).
>>>
>>> I guess the question is
>>>
>>> Is  the call to shouldRollOnCheckPoint.  done by the checkpointing
>>> thread ?
>>>
>>> Are the calls to the other 2 methods shouldRollOnEVent and
>>> shouldRollOnProcessingTIme done on the execution thread  as in inlined ?
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Feb 10, 2019 at 1:17 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
 Thanks for the quick reply.

 I am confused. If this was a more full featured BucketingSink ,I would
 imagine that  based on shouldRollOnEvent and shouldRollOnEvent, an in
 progress file could go into pending phase and on checkpoint the pending
 part file would be  finalized. For exactly once any files ( in progress
 file ) will have a length of the file  snapshotted to the checkpoint  and
 used to truncate the file ( if supported ) or dropped as a part-length file
 ( if truncate not supported )  if a resume from a checkpoint was to happen,
 to indicate what part of the the finalized file ( finalized when resumed )
 was valid . and  I had always assumed ( and there is no doc otherwise )
 that shouldRollOnCheckpoint would be similar to the other 2 apart from
 the fact it does the roll and finalize step in a single step on a
 checkpoint.


 Am I better off using BucketingSink ?  When to use BucketingSink and
 when to use RollingSink is not clear at all, even though at the surface it
 sure looks RollingSink is a better version of .BucketingSink ( or not )

 Regards.



 On Sun, Feb 10, 2019 at 12:09 PM Timothy Victor 
 wrote:

> I think the only rolling policy that can be used is
> CheckpointRollingPolicy to ensure exactly once.
>
> Tim
>
> On Sun, Feb 10, 2019, 9:13 AM Vishal Santoshi <
> vishal.santo...@gmail.com wrote:
>
>> Can StreamingFileSink be used instead of 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/filesystem_sink.html,
>>  even though it looks it could.
>>
>>
>> This code for example
>>
>>
>> StreamingFileSink
>> .forRowFormat(new Path(PATH),
>> new SimpleStringEncoder())
>> .withBucketAssigner(new 
>> KafkaRecordBucketAssigner(DEFAULT_FORMAT_STRING, ZONE_ID))
>> .withRollingPolicy(new RollingPolicy> String>() {
>>@Override
>>public boolean 
>> shouldRollOnCheckpoint(PartFileInfo partFileState) throws 
>> IOException {
>>return false;
>>}
>>
>>@Override
>>public boolean 
>> shouldRollOnEvent(PartFileInfo partFileState,
>> 
>> KafkaRecord element) throws IOException {
>>return 
>> partFileState.getSize() > 1024 * 1024 * 1024l;
>>}
>>
>>@Override
>>public boolean 
>> shouldRollOnProcessingTime(PartFileInfo partFileState, long 
>> currentTime) throws IOException {
>>return currentTime - 
>> partFileState.getLastUpdateTime() > 10 * 60 * 1000l ||
>>currentTime - 
>> partFileState.getCreationTime() > 120 * 60 * 1000l;
>>}
>>}
>> )
>> .build();
>>
>>
>> few things I see and am not sure I follow about the new RollingFileSink  
>> vis a vis BucketingSink
>>
>>
>> 1. I do not ever see the inprogress file go to the pending state, as in 
>> renamed as pending, as was the case in Bucketing Sink.  I would assume 
>> that it would be pending and then
>>
>>finalized on checkpoint for exactly once semantics ?

Re: Is there a windowing strategy that allows a different offset per key?

2019-02-10 Thread Stephen Connolly
Looking into the code in TumblingEventTimeWindows:

@Override
public Collection assignWindows(Object element, long timestamp,
WindowAssignerContext context) {
if (timestamp > Long.MIN_VALUE) {
// Long.MIN_VALUE is currently assigned when no timestamp is present
long start = TimeWindow.getWindowStartWithOffset(timestamp, offset, size);
return Collections.singletonList(new TimeWindow(start, start + size));
} else {
throw new RuntimeException("Record has Long.MIN_VALUE timestamp (= no
timestamp marker). " +
"Is the time characteristic set to 'ProcessingTime', or did you forget to
call " +
"'DataStream.assignTimestampsAndWatermarks(...)'?");
}
}

So I think I can just write my own where the offset is derived from hashing
the element using my hash function.

Good plan or bad plan?


On Sun, 10 Feb 2019 at 19:55, Stephen Connolly <
stephen.alan.conno...@gmail.com> wrote:

> I would like to process a stream of data firom different customers,
> producing output say once every 15 minutes. The results will then be loaded
> into another system for stoage and querying.
>
> I have been using TumblingEventTimeWindows in my prototype, but I am
> concerned that all the windows will start and stop at the same time and
> cause batch load effects on the back-end data store.
>
> What I think I would like is that the windows could have a different start
> offset for each key, (using a hash function that I would supply)
>
> Thus deterministically, key "ca:fe:ba:be" would always start based on an
> initail offset of 00:07 UTC while say key "de:ad:be:ef" would always start
> based on an initial offset of say 00:02 UTC
>
> Is this possible? Or do I just have to find some way of queuing up my
> writes using back-pressure?
>
> Thanks in advance
>
> -stephenc
>
> P.S. I can trade assistance with Flink for assistance with Maven or
> Jenkins if my questions are too wierysome!
>


Is there a windowing strategy that allows a different offset per key?

2019-02-10 Thread Stephen Connolly
I would like to process a stream of data firom different customers,
producing output say once every 15 minutes. The results will then be loaded
into another system for stoage and querying.

I have been using TumblingEventTimeWindows in my prototype, but I am
concerned that all the windows will start and stop at the same time and
cause batch load effects on the back-end data store.

What I think I would like is that the windows could have a different start
offset for each key, (using a hash function that I would supply)

Thus deterministically, key "ca:fe:ba:be" would always start based on an
initail offset of 00:07 UTC while say key "de:ad:be:ef" would always start
based on an initial offset of say 00:02 UTC

Is this possible? Or do I just have to find some way of queuing up my
writes using back-pressure?

Thanks in advance

-stephenc

P.S. I can trade assistance with Flink for assistance with Maven or Jenkins
if my questions are too wierysome!


Re: Reduce one event under multiple keys

2019-02-10 Thread Stephen Connolly
On Sun, 10 Feb 2019 at 09:09, Chesnay Schepler  wrote:

> This sounds reasonable to me.
>
> I'm a bit confused by this question: "*Additionally, I am (naïevely)
> hoping that if a window has no events for a particular key, the
> memory/storage costs are zero for that key.*"
>
> Are you asking whether a key that was received in window X (as part of an
> event) is still present in window x+1? If so, then the answer is no; a key
> will only be present in a given window if an event was received that fits
> into that window.
>

To confirm:

So let's say I'l tracking the average time a file is opened in folders.

In window N we get the events:

{"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/README.txt"}
{"source":"ca:fe:ba:be","action":"close","path":"/foo/bar/README.txt","duration":"67"}
{"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/User guide.txt"}
{"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/Admin guide.txt"}

So there will be aggregates stored for
("ca:fe:ba:be","/"), ("ca:fe:ba:be","/foo"), ("ca:fe:ba:be","/foo/bar"),
("ca:fe:ba:be","/foo/bar/README.txt"), etc

In window N+1 we do not get any events at all.

So the memory used by my aggregation functions from window N will be freed
and the storage will be effectively zero (modulo any follow on processing
that might be on a longer window)

This seems to be what you are saying... in which case my naïeve hope was
not so naïve! w00t!


>
> On 08.02.2019 13:21, Stephen Connolly wrote:
>
> Ok, I'll try and map my problem into something that should be familiar to
> most people.
>
> Consider collection of PCs, each of which has a unique ID, e.g.
> ca:fe:ba:be, de:ad:be:ef, etc.
>
> Each PC has a tree of local files. Some of the file paths are
> coincidentally the same names, but there is no file sharing between PCs.
>
> I need to produce metrics about how often files are opened and how long
> they are open for.
>
> I need for every X minute tumbling window not just the cumulative averages
> for each PC, but the averages for each file as well as the cumulative
> averegaes for each folder and their sub-folders.
>
> I have a stream of events like
>
>
> {"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/README.txt"}{"source":"ca:fe:ba:be","action":"close","path":"/foo/bar/README.txt","duration":"67"}{"source":"de:ad:be:ef","action":"open","path":"/foo/manchu/README.txt"}
> {"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/User
> guide.txt"}{"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/Admin
> guide.txt"}{"source":"ca:fe:ba:be","action":"close","path":"/foo/bar/User
> guide.txt","duration":"97"}{"source":"ca:fe:ba:be","action":"close","path":"/foo/bar/Admin
> guide.txt","duration":"196"}
> {"source":"ca:fe:ba:be","action":"open","path":"/foo/manchu/README.txt"}
> {"source":"de:ad:be:ef","action":"open","path":"/bar/foo/README.txt"}
>
> So from that I would like to know stuff like:
>
> ca:fe:ba:be had 4/X opens per minute in the X minute window
> ca:fe:ba:be had 3/X closes per minute in the X minute window and the
> average time open was (67+97+197)/3=120... there is no guarantee that the
> closes will be matched with opens in the same window, which is why I'm only
> tracking them separately
> de:ad:be:ef had 2/X opens per minute in the X minute window
> ca:fe:ba:be /foo had 4/X opens per minute in the X minute window
> ca:fe:ba:be /foo had 3/X closes per minute in the X minute window and the
> average time open was 120
> de:ad:be:ef /foo had 1/X opens per minute in the X minute window
> de:ad:be:ef /bar had 1/X opens per minute in the X minute window
> de:ad:be:ef /foo/manchu had 1/X opens per minute in the X minute window
> de:ad:be:ef /bar/foo had 1/X opens per minute in the X minute window
> de:ad:be:ef /foo/manchu/README.txt had 1/X opens per minute in the X
> minute window
> de:ad:be:ef /bar/foo/README.txt had 1/X opens per minute in the X minute
> window
> etc
>
> What I think I want to do is turn each event into a series of events with
> different keys, so that
>
> {"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/README.txt"}
>
> gets sent under the keys:
>
> ("ca:fe:ba:be","/")
> ("ca:fe:ba:be","/foo")
> ("ca:fe:ba:be","/foo/bar")
> ("ca:fe:ba:be","/foo/bar/README.txt")
>
> Then I could use a window aggregation function to just:
>
> * count the "open" events
> * count the "close" events and sum their duration
>
> Additionally, I am (naïevely) hoping that if a window has no events for a
> particular key, the memory/storage costs are zero for that key.
>
> From what I can see, to achieve what I am trying to do, I could use a
> flatMap followed by a keyBy
>
> In other words I take the events and flat map them based on the path split
> on '/' returning a Tuple of the (to be) key and the event. Then I can use
> keyBy to key based on the Tuple 0.
>
> My ask:
>
> Is the above design a good design? How would you achieve the end game
> better? Do I need to worry about many paths that are accessed rarely 

Re: Can an Aggregate the key from a WindowedStream.aggregate()

2019-02-10 Thread Stephen Connolly
On Sun, 10 Feb 2019 at 09:48, Chesnay Schepler  wrote:

> There are also versions of WindowedStream#aggregate that accept an
> additional WindowFunction/ProcessWindowFunction, which do have access to
> the key via apply()/process() respectively. These functions are called
> post aggregation.
>

Cool I'll chase those down


>
> On 08.02.2019 18:24, stephen.alan.conno...@gmail.com wrote:
> > If I write my aggregation logic as a WindowFunction then I get access to
> the key as the first parameter in WindowFunction.apply(...) however the
> Javadocs for calling WindowedStream.apply(WindowFunction) state:
> >
> >> Note that this function requires that all data in the windows is
> buffered until the window
> >> is evaluated, as the function provides no means of incremental
> aggregation.
> > Which sounds bad.
> >
> > It seems the recommended alternative is to use one of the
> WindowFunction.aggregate(AggregateFunction) however I cannot see how to get
> access to the key...
> >
> > Is my only solution to transform my data into a Tuple if I need access
> to the key post aggregation?
> >
> > Thanks in advance
> >
> > -stephenc
> >
>
>


Re: fllink 1.7.1 and RollingFileSink

2019-02-10 Thread Vishal Santoshi
You don't have to. Thank you for the input.

On Sun, Feb 10, 2019 at 1:56 PM Timothy Victor  wrote:

> My apologies for not seeing your use case properly.   The constraint on
> rolling policy is only applicable for bulk formats such as Parquet as
> highlighted in the docs.
>
> As for your questions, I'll have to defer to others more familiar with
> it.   I mostly just use bulk formats such as avro and parquet.
>
> Tim
>
>
> On Sun, Feb 10, 2019, 12:40 PM Vishal Santoshi  wrote:
>
>> That said the in the DefaultRollingPolicy it seems the check is on the
>> file size ( mimics the check shouldRollOnEVent()).
>>
>> I guess the question is
>>
>> Is  the call to shouldRollOnCheckPoint.  done by the checkpointing thread
>> ?
>>
>> Are the calls to the other 2 methods shouldRollOnEVent and
>> shouldRollOnProcessingTIme done on the execution thread  as in inlined ?
>>
>>
>>
>>
>>
>> On Sun, Feb 10, 2019 at 1:17 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Thanks for the quick reply.
>>>
>>> I am confused. If this was a more full featured BucketingSink ,I would
>>> imagine that  based on shouldRollOnEvent and shouldRollOnEvent, an in
>>> progress file could go into pending phase and on checkpoint the pending
>>> part file would be  finalized. For exactly once any files ( in progress
>>> file ) will have a length of the file  snapshotted to the checkpoint  and
>>> used to truncate the file ( if supported ) or dropped as a part-length file
>>> ( if truncate not supported )  if a resume from a checkpoint was to happen,
>>> to indicate what part of the the finalized file ( finalized when resumed )
>>> was valid . and  I had always assumed ( and there is no doc otherwise )
>>> that shouldRollOnCheckpoint would be similar to the other 2 apart from
>>> the fact it does the roll and finalize step in a single step on a
>>> checkpoint.
>>>
>>>
>>> Am I better off using BucketingSink ?  When to use BucketingSink and
>>> when to use RollingSink is not clear at all, even though at the surface it
>>> sure looks RollingSink is a better version of .BucketingSink ( or not )
>>>
>>> Regards.
>>>
>>>
>>>
>>> On Sun, Feb 10, 2019 at 12:09 PM Timothy Victor 
>>> wrote:
>>>
 I think the only rolling policy that can be used is
 CheckpointRollingPolicy to ensure exactly once.

 Tim

 On Sun, Feb 10, 2019, 9:13 AM Vishal Santoshi <
 vishal.santo...@gmail.com wrote:

> Can StreamingFileSink be used instead of 
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/filesystem_sink.html,
>  even though it looks it could.
>
>
> This code for example
>
>
> StreamingFileSink
> .forRowFormat(new Path(PATH),
> new SimpleStringEncoder())
> .withBucketAssigner(new 
> KafkaRecordBucketAssigner(DEFAULT_FORMAT_STRING, ZONE_ID))
> .withRollingPolicy(new RollingPolicy String>() {
>@Override
>public boolean 
> shouldRollOnCheckpoint(PartFileInfo partFileState) throws 
> IOException {
>return false;
>}
>
>@Override
>public boolean 
> shouldRollOnEvent(PartFileInfo partFileState,
> 
> KafkaRecord element) throws IOException {
>return partFileState.getSize() 
> > 1024 * 1024 * 1024l;
>}
>
>@Override
>public boolean 
> shouldRollOnProcessingTime(PartFileInfo partFileState, long 
> currentTime) throws IOException {
>return currentTime - 
> partFileState.getLastUpdateTime() > 10 * 60 * 1000l ||
>currentTime - 
> partFileState.getCreationTime() > 120 * 60 * 1000l;
>}
>}
> )
> .build();
>
>
> few things I see and am not sure I follow about the new RollingFileSink  
> vis a vis BucketingSink
>
>
> 1. I do not ever see the inprogress file go to the pending state, as in 
> renamed as pending, as was the case in Bucketing Sink.  I would assume 
> that it would be pending and then
>
>finalized on checkpoint for exactly once semantics ?
>
>
> 2. I see dangling inprogress files at the end of the day. I would assume 
> that the withBucketCheckInterval set to 1 minute by default, the 
> shouldRollOnProcessingTime 

Re: fllink 1.7.1 and RollingFileSink

2019-02-10 Thread Timothy Victor
My apologies for not seeing your use case properly.   The constraint on
rolling policy is only applicable for bulk formats such as Parquet as
highlighted in the docs.

As for your questions, I'll have to defer to others more familiar with it.
 I mostly just use bulk formats such as avro and parquet.

Tim


On Sun, Feb 10, 2019, 12:40 PM Vishal Santoshi  That said the in the DefaultRollingPolicy it seems the check is on the
> file size ( mimics the check shouldRollOnEVent()).
>
> I guess the question is
>
> Is  the call to shouldRollOnCheckPoint.  done by the checkpointing thread
> ?
>
> Are the calls to the other 2 methods shouldRollOnEVent and
> shouldRollOnProcessingTIme done on the execution thread  as in inlined ?
>
>
>
>
>
> On Sun, Feb 10, 2019 at 1:17 PM Vishal Santoshi 
> wrote:
>
>> Thanks for the quick reply.
>>
>> I am confused. If this was a more full featured BucketingSink ,I would
>> imagine that  based on shouldRollOnEvent and shouldRollOnEvent, an in
>> progress file could go into pending phase and on checkpoint the pending
>> part file would be  finalized. For exactly once any files ( in progress
>> file ) will have a length of the file  snapshotted to the checkpoint  and
>> used to truncate the file ( if supported ) or dropped as a part-length file
>> ( if truncate not supported )  if a resume from a checkpoint was to happen,
>> to indicate what part of the the finalized file ( finalized when resumed )
>> was valid . and  I had always assumed ( and there is no doc otherwise )
>> that shouldRollOnCheckpoint would be similar to the other 2 apart from
>> the fact it does the roll and finalize step in a single step on a
>> checkpoint.
>>
>>
>> Am I better off using BucketingSink ?  When to use BucketingSink and when
>> to use RollingSink is not clear at all, even though at the surface it sure
>> looks RollingSink is a better version of .BucketingSink ( or not )
>>
>> Regards.
>>
>>
>>
>> On Sun, Feb 10, 2019 at 12:09 PM Timothy Victor 
>> wrote:
>>
>>> I think the only rolling policy that can be used is
>>> CheckpointRollingPolicy to ensure exactly once.
>>>
>>> Tim
>>>
>>> On Sun, Feb 10, 2019, 9:13 AM Vishal Santoshi >> wrote:
>>>
 Can StreamingFileSink be used instead of 
 https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/filesystem_sink.html,
  even though it looks it could.


 This code for example


 StreamingFileSink
 .forRowFormat(new Path(PATH),
 new SimpleStringEncoder())
 .withBucketAssigner(new 
 KafkaRecordBucketAssigner(DEFAULT_FORMAT_STRING, ZONE_ID))
 .withRollingPolicy(new RollingPolicy>>> String>() {
@Override
public boolean 
 shouldRollOnCheckpoint(PartFileInfo partFileState) throws 
 IOException {
return false;
}

@Override
public boolean 
 shouldRollOnEvent(PartFileInfo partFileState,
 
 KafkaRecord element) throws IOException {
return partFileState.getSize() 
 > 1024 * 1024 * 1024l;
}

@Override
public boolean 
 shouldRollOnProcessingTime(PartFileInfo partFileState, long 
 currentTime) throws IOException {
return currentTime - 
 partFileState.getLastUpdateTime() > 10 * 60 * 1000l ||
currentTime - 
 partFileState.getCreationTime() > 120 * 60 * 1000l;
}
}
 )
 .build();


 few things I see and am not sure I follow about the new RollingFileSink  
 vis a vis BucketingSink


 1. I do not ever see the inprogress file go to the pending state, as in 
 renamed as pending, as was the case in Bucketing Sink.  I would assume 
 that it would be pending and then

finalized on checkpoint for exactly once semantics ?


 2. I see dangling inprogress files at the end of the day. I would assume 
 that the withBucketCheckInterval set to 1 minute by default, the 
 shouldRollOnProcessingTime should kick in ?

  3. The inprogress files are  like 
 .part-4-45.inprogress.3ed08b67-2b56-4d31-a233-b1c146cfcf14 . What is that 
 additional suffix ?




 I have the following set up on the env

 env.enableCheckpointing(10 * 6);
 

Re: fllink 1.7.1 and RollingFileSink

2019-02-10 Thread Vishal Santoshi
That said the in the DefaultRollingPolicy it seems the check is on the file
size ( mimics the check shouldRollOnEVent()).

I guess the question is

Is  the call to shouldRollOnCheckPoint.  done by the checkpointing thread ?

Are the calls to the other 2 methods shouldRollOnEVent and
shouldRollOnProcessingTIme done on the execution thread  as in inlined ?





On Sun, Feb 10, 2019 at 1:17 PM Vishal Santoshi 
wrote:

> Thanks for the quick reply.
>
> I am confused. If this was a more full featured BucketingSink ,I would
> imagine that  based on shouldRollOnEvent and shouldRollOnEvent, an in
> progress file could go into pending phase and on checkpoint the pending
> part file would be  finalized. For exactly once any files ( in progress
> file ) will have a length of the file  snapshotted to the checkpoint  and
> used to truncate the file ( if supported ) or dropped as a part-length file
> ( if truncate not supported )  if a resume from a checkpoint was to happen,
> to indicate what part of the the finalized file ( finalized when resumed )
> was valid . and  I had always assumed ( and there is no doc otherwise )
> that shouldRollOnCheckpoint would be similar to the other 2 apart from
> the fact it does the roll and finalize step in a single step on a
> checkpoint.
>
>
> Am I better off using BucketingSink ?  When to use BucketingSink and when
> to use RollingSink is not clear at all, even though at the surface it sure
> looks RollingSink is a better version of .BucketingSink ( or not )
>
> Regards.
>
>
>
> On Sun, Feb 10, 2019 at 12:09 PM Timothy Victor  wrote:
>
>> I think the only rolling policy that can be used is
>> CheckpointRollingPolicy to ensure exactly once.
>>
>> Tim
>>
>> On Sun, Feb 10, 2019, 9:13 AM Vishal Santoshi > wrote:
>>
>>> Can StreamingFileSink be used instead of 
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/filesystem_sink.html,
>>>  even though it looks it could.
>>>
>>>
>>> This code for example
>>>
>>>
>>> StreamingFileSink
>>> .forRowFormat(new Path(PATH),
>>> new SimpleStringEncoder())
>>> .withBucketAssigner(new 
>>> KafkaRecordBucketAssigner(DEFAULT_FORMAT_STRING, ZONE_ID))
>>> .withRollingPolicy(new RollingPolicy() 
>>> {
>>>@Override
>>>public boolean 
>>> shouldRollOnCheckpoint(PartFileInfo partFileState) throws 
>>> IOException {
>>>return false;
>>>}
>>>
>>>@Override
>>>public boolean 
>>> shouldRollOnEvent(PartFileInfo partFileState,
>>> 
>>> KafkaRecord element) throws IOException {
>>>return partFileState.getSize() > 
>>> 1024 * 1024 * 1024l;
>>>}
>>>
>>>@Override
>>>public boolean 
>>> shouldRollOnProcessingTime(PartFileInfo partFileState, long 
>>> currentTime) throws IOException {
>>>return currentTime - 
>>> partFileState.getLastUpdateTime() > 10 * 60 * 1000l ||
>>>currentTime - 
>>> partFileState.getCreationTime() > 120 * 60 * 1000l;
>>>}
>>>}
>>> )
>>> .build();
>>>
>>>
>>> few things I see and am not sure I follow about the new RollingFileSink  
>>> vis a vis BucketingSink
>>>
>>>
>>> 1. I do not ever see the inprogress file go to the pending state, as in 
>>> renamed as pending, as was the case in Bucketing Sink.  I would assume that 
>>> it would be pending and then
>>>
>>>finalized on checkpoint for exactly once semantics ?
>>>
>>>
>>> 2. I see dangling inprogress files at the end of the day. I would assume 
>>> that the withBucketCheckInterval set to 1 minute by default, the 
>>> shouldRollOnProcessingTime should kick in ?
>>>
>>>  3. The inprogress files are  like 
>>> .part-4-45.inprogress.3ed08b67-2b56-4d31-a233-b1c146cfcf14 . What is that 
>>> additional suffix ?
>>>
>>>
>>>
>>>
>>> I have the following set up on the env
>>>
>>> env.enableCheckpointing(10 * 6);
>>> env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>>> env.setRestartStrategy(fixedDelayRestart(4, 
>>> org.apache.flink.api.common.time.Time.minutes(1)));
>>> StateBackend stateBackEnd = new MemoryStateBackend();
>>> env.setStateBackend(stateBackEnd);
>>>
>>>
>>> Regards.
>>>
>>>
>>>
>>>
>>>


Re: fllink 1.7.1 and RollingFileSink

2019-02-10 Thread Vishal Santoshi
Thanks for the quick reply.

I am confused. If this was a more full featured BucketingSink ,I would
imagine that  based on shouldRollOnEvent and shouldRollOnEvent, an in
progress file could go into pending phase and on checkpoint the pending
part file would be  finalized. For exactly once any files ( in progress
file ) will have a length of the file  snapshotted to the checkpoint  and
used to truncate the file ( if supported ) or dropped as a part-length file
( if truncate not supported )  if a resume from a checkpoint was to happen,
to indicate what part of the the finalized file ( finalized when resumed )
was valid . and  I had always assumed ( and there is no doc otherwise )
that shouldRollOnCheckpoint would be similar to the other 2 apart from the
fact it does the roll and finalize step in a single step on a checkpoint.


Am I better off using BucketingSink ?  When to use BucketingSink and when
to use RollingSink is not clear at all, even though at the surface it sure
looks RollingSink is a better version of .BucketingSink ( or not )

Regards.



On Sun, Feb 10, 2019 at 12:09 PM Timothy Victor  wrote:

> I think the only rolling policy that can be used is
> CheckpointRollingPolicy to ensure exactly once.
>
> Tim
>
> On Sun, Feb 10, 2019, 9:13 AM Vishal Santoshi  wrote:
>
>> Can StreamingFileSink be used instead of 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/filesystem_sink.html,
>>  even though it looks it could.
>>
>>
>> This code for example
>>
>>
>> StreamingFileSink
>> .forRowFormat(new Path(PATH),
>> new SimpleStringEncoder())
>> .withBucketAssigner(new 
>> KafkaRecordBucketAssigner(DEFAULT_FORMAT_STRING, ZONE_ID))
>> .withRollingPolicy(new RollingPolicy() {
>>@Override
>>public boolean 
>> shouldRollOnCheckpoint(PartFileInfo partFileState) throws 
>> IOException {
>>return false;
>>}
>>
>>@Override
>>public boolean 
>> shouldRollOnEvent(PartFileInfo partFileState,
>> 
>> KafkaRecord element) throws IOException {
>>return partFileState.getSize() > 
>> 1024 * 1024 * 1024l;
>>}
>>
>>@Override
>>public boolean 
>> shouldRollOnProcessingTime(PartFileInfo partFileState, long 
>> currentTime) throws IOException {
>>return currentTime - 
>> partFileState.getLastUpdateTime() > 10 * 60 * 1000l ||
>>currentTime - 
>> partFileState.getCreationTime() > 120 * 60 * 1000l;
>>}
>>}
>> )
>> .build();
>>
>>
>> few things I see and am not sure I follow about the new RollingFileSink  vis 
>> a vis BucketingSink
>>
>>
>> 1. I do not ever see the inprogress file go to the pending state, as in 
>> renamed as pending, as was the case in Bucketing Sink.  I would assume that 
>> it would be pending and then
>>
>>finalized on checkpoint for exactly once semantics ?
>>
>>
>> 2. I see dangling inprogress files at the end of the day. I would assume 
>> that the withBucketCheckInterval set to 1 minute by default, the 
>> shouldRollOnProcessingTime should kick in ?
>>
>>  3. The inprogress files are  like 
>> .part-4-45.inprogress.3ed08b67-2b56-4d31-a233-b1c146cfcf14 . What is that 
>> additional suffix ?
>>
>>
>>
>>
>> I have the following set up on the env
>>
>> env.enableCheckpointing(10 * 6);
>> env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>> env.setRestartStrategy(fixedDelayRestart(4, 
>> org.apache.flink.api.common.time.Time.minutes(1)));
>> StateBackend stateBackEnd = new MemoryStateBackend();
>> env.setStateBackend(stateBackEnd);
>>
>>
>> Regards.
>>
>>
>>
>>
>>


Re: fllink 1.7.1 and RollingFileSink

2019-02-10 Thread Timothy Victor
I think the only rolling policy that can be used is CheckpointRollingPolicy
to ensure exactly once.

Tim

On Sun, Feb 10, 2019, 9:13 AM Vishal Santoshi  Can StreamingFileSink be used instead of 
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/filesystem_sink.html,
>  even though it looks it could.
>
>
> This code for example
>
>
> StreamingFileSink
> .forRowFormat(new Path(PATH),
> new SimpleStringEncoder())
> .withBucketAssigner(new 
> KafkaRecordBucketAssigner(DEFAULT_FORMAT_STRING, ZONE_ID))
> .withRollingPolicy(new RollingPolicy() {
>@Override
>public boolean 
> shouldRollOnCheckpoint(PartFileInfo partFileState) throws IOException 
> {
>return false;
>}
>
>@Override
>public boolean 
> shouldRollOnEvent(PartFileInfo partFileState,
> 
> KafkaRecord element) throws IOException {
>return partFileState.getSize() > 
> 1024 * 1024 * 1024l;
>}
>
>@Override
>public boolean 
> shouldRollOnProcessingTime(PartFileInfo partFileState, long 
> currentTime) throws IOException {
>return currentTime - 
> partFileState.getLastUpdateTime() > 10 * 60 * 1000l ||
>currentTime - 
> partFileState.getCreationTime() > 120 * 60 * 1000l;
>}
>}
> )
> .build();
>
>
> few things I see and am not sure I follow about the new RollingFileSink  vis 
> a vis BucketingSink
>
>
> 1. I do not ever see the inprogress file go to the pending state, as in 
> renamed as pending, as was the case in Bucketing Sink.  I would assume that 
> it would be pending and then
>
>finalized on checkpoint for exactly once semantics ?
>
>
> 2. I see dangling inprogress files at the end of the day. I would assume that 
> the withBucketCheckInterval set to 1 minute by default, the 
> shouldRollOnProcessingTime should kick in ?
>
>  3. The inprogress files are  like 
> .part-4-45.inprogress.3ed08b67-2b56-4d31-a233-b1c146cfcf14 . What is that 
> additional suffix ?
>
>
>
>
> I have the following set up on the env
>
> env.enableCheckpointing(10 * 6);
> env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
> env.setRestartStrategy(fixedDelayRestart(4, 
> org.apache.flink.api.common.time.Time.minutes(1)));
> StateBackend stateBackEnd = new MemoryStateBackend();
> env.setStateBackend(stateBackEnd);
>
>
> Regards.
>
>
>
>
>


fllink 1.7.1 and RollingFileSink

2019-02-10 Thread Vishal Santoshi
Can StreamingFileSink be used instead of
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/connectors/filesystem_sink.html,
even though it looks it could.


This code for example


StreamingFileSink
.forRowFormat(new Path(PATH),
new SimpleStringEncoder())
.withBucketAssigner(new
KafkaRecordBucketAssigner(DEFAULT_FORMAT_STRING, ZONE_ID))
.withRollingPolicy(new RollingPolicy() {
   @Override
   public boolean
shouldRollOnCheckpoint(PartFileInfo partFileState) throws
IOException {
   return false;
   }

   @Override
   public boolean
shouldRollOnEvent(PartFileInfo partFileState,

 KafkaRecord element) throws IOException {
   return
partFileState.getSize() > 1024 * 1024 * 1024l;
   }

   @Override
   public boolean
shouldRollOnProcessingTime(PartFileInfo partFileState, long
currentTime) throws IOException {
   return currentTime -
partFileState.getLastUpdateTime() > 10 * 60 * 1000l ||
   currentTime -
partFileState.getCreationTime() > 120 * 60 * 1000l;
   }
   }
)
.build();


few things I see and am not sure I follow about the new
RollingFileSink  vis a vis BucketingSink


1. I do not ever see the inprogress file go to the pending state, as
in renamed as pending, as was the case in Bucketing Sink.  I would
assume that it would be pending and then

   finalized on checkpoint for exactly once semantics ?


2. I see dangling inprogress files at the end of the day. I would
assume that the withBucketCheckInterval set to 1 minute by default,
the shouldRollOnProcessingTime should kick in ?

 3. The inprogress files are  like
.part-4-45.inprogress.3ed08b67-2b56-4d31-a233-b1c146cfcf14 . What is
that additional suffix ?




I have the following set up on the env

env.enableCheckpointing(10 * 6);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.setRestartStrategy(fixedDelayRestart(4,
org.apache.flink.api.common.time.Time.minutes(1)));
StateBackend stateBackEnd = new MemoryStateBackend();
env.setStateBackend(stateBackEnd);


Regards.


Re: stream of large objects

2019-02-10 Thread Chesnay Schepler
The Broadcast State 
 
may be interesting to you.


On 08.02.2019 15:57, Aggarwal, Ajay wrote:


Yes, another KeyBy will be used. The “small size” messages will be 
strings of length 500 to 1000.


Is there a concept of “global” state in flink? Is it possible to keep 
these lists in global state and only pass the list reference (by 
name?) in the LargeMessage?


*From: *Chesnay Schepler 
*Date: *Friday, February 8, 2019 at 8:45 AM
*To: *"Aggarwal, Ajay" , 
"user@flink.apache.org" 

*Subject: *Re: stream of large objects

Whether a LargeMessage is serialized depends on how the job is structured.
For example, if you were to only apply map/filter functions after the 
aggregation it is likely they wouldn't be serialized.

If you were to apply another keyBy they will be serialized again.

When you say "small size" messages, what are we talking about here?

On 07.02.2019 20:37, Aggarwal, Ajay wrote:

In my use case my source stream contain small size messages, but
as part of flink processing I will be aggregating them into large
messages and further processing will happen on these large
messages. The structure of this large message will be something
like this:

   Class LargeMessage {

  String key

   List  messages; // this is where the aggregation of
smaller messages happen

   }

In some cases this list field of LargeMessage can get very large
(1000’s of messages). Is it ok to create an intermediate stream of
these LargeMessages? What should I be concerned about while
designing the flink job? Specifically with parallelism in mind. As
these LargeMessages flow from one flink subtask to another, do
they get serialized/deserialized ?

Thanks.





Re: Can an Aggregate the key from a WindowedStream.aggregate()

2019-02-10 Thread Chesnay Schepler
There are also versions of WindowedStream#aggregate that accept an 
additional WindowFunction/ProcessWindowFunction, which do have access to 
the key via apply()/process() respectively. These functions are called 
post aggregation.


On 08.02.2019 18:24, stephen.alan.conno...@gmail.com wrote:

If I write my aggregation logic as a WindowFunction then I get access to the 
key as the first parameter in WindowFunction.apply(...) however the Javadocs 
for calling WindowedStream.apply(WindowFunction) state:


Note that this function requires that all data in the windows is buffered until 
the window
is evaluated, as the function provides no means of incremental aggregation.

Which sounds bad.

It seems the recommended alternative is to use one of the 
WindowFunction.aggregate(AggregateFunction) however I cannot see how to get 
access to the key...

Is my only solution to transform my data into a Tuple if I need access to the 
key post aggregation?

Thanks in advance

-stephenc





Re: Help with a stream processing use case

2019-02-10 Thread Chesnay Schepler
I'll need someone else to chime in here for a definitive answer (cc'd 
Gordon), so I'm really just guessing here.


For the partitioning it looks like you can use a custom partitioner, see 
FlinkKinesisProducer#setCustomPartitioner.
Have you looked at the KinesisSerializationSchema described in the 
documentation 
? 
It allows you to write to a specific stream based on incoming events, 
but I'm not sure whether this translates to S3 buckets and keyspaces.


On 08.02.2019 16:43, Sandybayev, Turar (CAI - Atlanta) wrote:


Hi all,

I wonder whether it’s possible to use Flink for the following 
requirement. We need to process a Kinesis stream and based on values 
in each record, route those records to different S3 buckets and 
keyspaces, with support for batching up of files and control over 
partitioning scheme (so preferably through Firehose).


I know it’s straightforward to have a Kinesis source and a Kinesis 
sink, and the hook up Firehose to the sink from AWS, but I need a “fan 
out” to potentially thousands of different buckets, based on content 
of each event.


Thanks!

Turar





Re: Running single Flink job in a job cluster, problem starting JobManager

2019-02-10 Thread Chesnay Schepler
I'm afraid we haven't had much experience with Spring Boot Flink 
applications.


It is indeed strange that the job ends up using a StreamPlanEnvironment.
As a debugging step I would look into all calls to 
ExecutionEnviroment#initializeContextEnvironment().
This is how specific execution environments are injected into 
(Stream)ExecutionEnvironment#getEnvironment().


On 08.02.2019 15:17, Thomas Eckestad wrote:

Hi again,

when removing Spring Boot from the application it works.

I would really like to mix Spring Boot and Flink. It does work with 
Spring Boot when submitting jobs to a session cluster, as stated before.


/Thomas

*From:* Thomas Eckestad 
*Sent:* Friday, February 8, 2019 12:14 PM
*To:* user@flink.apache.org
*Subject:* Running single Flink job in a job cluster, problem starting 
JobManager

Hi,

I am trying to run a flink job cluster in K8s. As a first step I have 
created a Docker image according to:


https://github.com/apache/flink/blob/release-1.7/flink-container/docker/README.md

When I try to run the image:

docker run --name=flink-job-manager flink-image:latest job-cluster 
--job-classname com.foo.bar.FlinkTest 
-Djobmanager.rpc.address=flink-job-cluster -Dparallelism.default=1 
-Dblob.server.port=6124 -Dqueryable-state.server.ports=6125


the execution fails with the following exception:

org.springframework.beans.factory.BeanCreationException: Error 
creating bean with name 'MyFlinkJob': Invocation of init method 
failed; nested exception is 
org.apache.flink.client.program.OptimizerPlanEnvironment$ProgramAbortException
at 
org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:139)
at 
org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:419)
at 
org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1737)
at 
org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:576)
at 
org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:498)
at 
org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:320)
at 
org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:222)
at 
org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:318)
at 
org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:199)
at 
org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:846)
at 
org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:863)
at 
org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:546)
at 
org.springframework.boot.SpringApplication.refresh(SpringApplication.java:775)
at 
org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:397)
at 
org.springframework.boot.SpringApplication.run(SpringApplication.java:316)
at 
org.springframework.boot.SpringApplication.run(SpringApplication.java:1260)
at 
org.springframework.boot.SpringApplication.run(SpringApplication.java:1248)

at com.foo.bar.FlinkTest.main(FlinkTest.java:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.springframework.boot.devtools.restart.RestartLauncher.run(RestartLauncher.java:49)
Caused by: 
org.apache.flink.client.program.OptimizerPlanEnvironment$ProgramAbortException
at 
org.apache.flink.streaming.api.environment.StreamPlanEnvironment.execute(StreamPlanEnvironment.java:70)

at com.foo.bar.FlinkJob.MyFlinkJob.init(MyFlinkJob.java:59)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:363)
at 
org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:307)
at 

Re: Flink Standalone cluster - dumps

2019-02-10 Thread Chesnay Schepler
1) Setting the slot size to 1 can be used as a work-around. I'm not 
aware of another solution for standalone clusters. In the 
YARN/Kubernetes world we support the notion of a "job cluster", which is 
started and run only for a single job, but this isn't supported in 
standalone mode.

2) None that I'm aware of unfortunately.

On 08.02.2019 15:06, simpleusr wrote:

Flink Standalone cluster - dumps

We are using standalone cluster and submittig jobs through command line
client.

As far as I understand, the job is executed in task manager. A single task
manager represents a single jvm? So the dump shows threads from all jobs
bound to task manager.
Two questions:

1) Is there a special setting so that each task manager is occupied by a
single job. Can setting slot size to 1 be a workaround?
2) If option above is not possible is there any way to get dump per job?


Regards



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/





Re: Flink Standalone cluster - logging problem

2019-02-10 Thread Chesnay Schepler

What exactly are you expecting to happen?

On 08.02.2019 15:06, simpleusr wrote:

We are using standalone cluster and submittig jobs through command line
client.

As stated in
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/logging.html
, we are editing log4j-cli.properties but this does not make any effect?

Anybody seen that before?

Regards




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/





Re: Reduce one event under multiple keys

2019-02-10 Thread Chesnay Schepler

This sounds reasonable to me.

I'm a bit confused by this question: "/Additionally, I am (naïevely) 
hoping that if a window has no events for a particular key, the 
memory/storage costs are zero for that key./"


Are you asking whether a key that was received in window X (as part of 
an event) is still present in window x+1? If so, then the answer is no; 
a key will only be present in a given window if an event was received 
that fits into that window.


On 08.02.2019 13:21, Stephen Connolly wrote:
Ok, I'll try and map my problem into something that should be familiar 
to most people.


Consider collection of PCs, each of which has a unique ID, e.g. 
ca:fe:ba:be, de:ad:be:ef, etc.


Each PC has a tree of local files. Some of the file paths are 
coincidentally the same names, but there is no file sharing between PCs.


I need to produce metrics about how often files are opened and how 
long they are open for.


I need for every X minute tumbling window not just the cumulative 
averages for each PC, but the averages for each file as well as the 
cumulative averegaes for each folder and their sub-folders.


I have a stream of events like

{"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/README.txt"}{"source":"ca:fe:ba:be","action":"close","path":"/foo/bar/README.txt","duration":"67"}{"source":"de:ad:be:ef","action":"open","path":"/foo/manchu/README.txt"}
{"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/User 
guide.txt"}{"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/Admin 
guide.txt"}{"source":"ca:fe:ba:be","action":"close","path":"/foo/bar/User 
guide.txt","duration":"97"}{"source":"ca:fe:ba:be","action":"close","path":"/foo/bar/Admin 
guide.txt","duration":"196"}

{"source":"ca:fe:ba:be","action":"open","path":"/foo/manchu/README.txt"}
{"source":"de:ad:be:ef","action":"open","path":"/bar/foo/README.txt"}

So from that I would like to know stuff like:

ca:fe:ba:be had 4/X opens per minute in the X minute window
ca:fe:ba:be had 3/X closes per minute in the X minute window and the 
average time open was (67+97+197)/3=120... there is no guarantee that 
the closes will be matched with opens in the same window, which is why 
I'm only tracking them separately

de:ad:be:ef had 2/X opens per minute in the X minute window
ca:fe:ba:be /foo had 4/X opens per minute in the X minute window
ca:fe:ba:be /foo had 3/X closes per minute in the X minute window and 
the average time open was 120

de:ad:be:ef /foo had 1/X opens per minute in the X minute window
de:ad:be:ef /bar had 1/X opens per minute in the X minute window
de:ad:be:ef /foo/manchu had 1/X opens per minute in the X minute window
de:ad:be:ef /bar/foo had 1/X opens per minute in the X minute window
de:ad:be:ef /foo/manchu/README.txt had 1/X opens per minute in the X 
minute window
de:ad:be:ef /bar/foo/README.txt had 1/X opens per minute in the X 
minute window

etc

What I think I want to do is turn each event into a series of events 
with different keys, so that


{"source":"ca:fe:ba:be","action":"open","path":"/foo/bar/README.txt"}

gets sent under the keys:

("ca:fe:ba:be","/")
("ca:fe:ba:be","/foo")
("ca:fe:ba:be","/foo/bar")
("ca:fe:ba:be","/foo/bar/README.txt")

Then I could use a window aggregation function to just:

* count the "open" events
* count the "close" events and sum their duration

Additionally, I am (naïevely) hoping that if a window has no events 
for a particular key, the memory/storage costs are zero for that key.


From what I can see, to achieve what I am trying to do, I could use a 
flatMap followed by a keyBy


In other words I take the events and flat map them based on the path 
split on '/' returning a Tuple of the (to be) key and the event. Then 
I can use keyBy to key based on the Tuple 0.


My ask:

Is the above design a good design? How would you achieve the end game 
better? Do I need to worry about many paths that are accessed rarely 
and would have an accumulator function that stays at 0 unless there 
are events in that window... or are the accumulators for each distinct 
key eagerly purged after each fire trigger.


What gotcha's do I need to look for.

Thanks in advance and appologies for the length

-stephenc





Re: Couldn't figure out - How to do this in Flink? - Pls assist with suggestions

2019-02-10 Thread Chesnay Schepler
You should be able to use a KeyedProcessFunction 
for 
that.

Find matching elements via keyBy() on the first field.
Aggregate into ValueState, send alert if necessary.
Upon encountering a new key, setup a timer to remove the entry in 24h.

On 08.02.2019 07:43, Titus Rakkesh wrote:


Dears,

I have a data stream continuously coming,

DataStream> splitZTuple;

Eg  - (775168263,113182,0.0)

I have to store this for 24 hrs expiry in somewhere (Window or 
somewhere) to check against another stream.


The second stream is

DataStream> splittedVomsTuple which also 
continuously receiving one.


Eg. (775168263,100.0)


We need to accumulate the third element in (775168263,113182,*/0.0/*) 
in the WINDOW (If the corresponding first element match happened with 
the incoming second streams second element 775168263,*/100.0/*)


While keeping this WINDOW session if any (775168263,113182,*/175/*) 
third element in the Window Stream exceed a value (Eg >150) we need to 
call back a another rest point to send an alert --- 
(775168263,113182,*/175/*) match the criteria. Simply a CEP call back.



In Flink how we can do this kind of operations? Or do I need to think 
about any other framework? Please advise.


Thanks...