Re: Checking actual config values used by TaskManager

2016-04-28 Thread Timur Fayruzov
If you're talking about parameters that were set on JVM startup then `ps
aux|grep flink` on an EMR slave node should do the trick, that'll give you
the full command line.

On Thu, Apr 28, 2016 at 9:00 PM, Ken Krugler 
wrote:

> Hi all,
>
> I’m running jobs on EMR via YARN, and wondering how to check exactly what
> configuration settings are actually being used.
>
> This is mostly for the TaskManager.
>
> I know I can modify the conf/flink-conf.yaml file, and (via the CLI) I can
> use -yD param=value.
>
> But my experience with Hadoop makes me want to see the exact values being
> used, versus assuming I know what’s been set :)
>
> Thanks,
>
> — Ken
>
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>


EMR vCores and slot allocation

2016-04-28 Thread Ken Krugler
Based on what Flink reports in the JobManager GUI, it looks like it thinks that 
the EC2 instances I’m using for my EMR jobs only have 4 physical cores.

Which would make sense, as Amazon describes these servers as having 8 vCores.

From https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html 
, the 
recommended configuration would then be 4 slots/TaskManager, yes?

Thanks,

— Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Anyone going to ApacheCon Big Data in Vancouver?

2016-04-28 Thread Ken Krugler
Hi all,

Is anyone else from the community going?

It would be fun to meet up with other Flink users during the event.

I’ll be there from Sunday (May 8th) to early Wednesday afternoon (May 11th).

— Ken

PS - On Monday I’ll be giving a talk 

 on my experience with using the Flink planner for Cascading.



Re: Multiple windows with large number of partitions

2016-04-28 Thread Christopher Santiago
Hi Aljoscha,


Aljoscha Krettek wrote
>>is there are reason for keying on both the "date only" field and the
"userid". I think you should be fine by just specifying that you want 1-day
windows on your timestamps.

My mistake, this was from earlier tests that I had performed.  I removed it
and went to keyBy(2) and I am still experiencing the same issues.


Aljoscha Krettek wrote
>>Also, do you have a timestamp extractor in place that takes the timestamp
from your data and sets it as the internal timestamp field. 

Yes there is, it is from the BoundedOutOfOrdernessGenerator example:

public static class BoundedOutOfOrdernessGenerator implements
AssignerWithPeriodicWatermarks> {
private static final long serialVersionUID = 1L;
private final long maxOutOfOrderness = Time.days(2).toMilliseconds();
private long currentMaxTimestamp;

@Override
public long extractTimestamp(Tuple3
element, long previousElementTimestamp) {
long timestamp = element.f0.getMillis(); 
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
return timestamp;
}

@Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
}

Thanks,
Chris



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Multiple-windows-with-large-number-of-partitions-tp6521p6562.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Flink Client use remote app jar

2016-04-28 Thread Theofilos Kakantousis
Thanks for the update. Hopefully when I get the time, I will try to 
contribute.


Cheers,
Theo

On 2016-04-27 11:24, Till Rohrmann wrote:
At the moment, there is no concrete plan to introduce such a feature, 
because it cannot be guaranteed that you always have a distributed 
file system available. But we could maybe add it as a tool which we 
contribute to flink-contrib. Do you wanna take the lead?


Cheers,
Till

On Wed, Apr 27, 2016 at 10:12 AM, Theofilos Kakantousis > wrote:


Hi Till,

Thank you for the quick reply. Do you think that would be a useful
feature in the future, for the Client to automatically download an
job jar from HDFS, or there are no plans to introduce it?

Cheers,
Theofilos


On 2016-04-27 10:42, Till Rohrmann wrote:


Hi Theofilos,

I'm afraid, but that is currently not possible with Flink. Flink
expects the user code jar to be uploaded to its Blob server.
That's what the client does prior to submitting the job. You
would have to upload the jar with the BlobClient manually if you
wanted to circumvent the Client.

Cheers,
Till

On Apr 26, 2016 11:54 PM, "Theofilos Kakantousis" > wrote:

Hi everyone,

Flink 0.10.1
Hadoop 2.4.0

Fairly new to Flink here, so my question might be simple but
couldn't find something relevant in the docs. I am
implementing a Flink client that submits jobs to a Flink Yarn
cluster. At the moment I am using the Client and
PackagedProgram classes which are working fine, however the
latter expects the job jar to be available locally so that
the Client can submit it to the Flink Yarn cluster.

Is it possible to use a job jar that is already stored in the
HDFS of the cluster where Flink is running on, without first
copying it locally to where the Client is?

Thank you,
Theofilos








Problem in creating quickstart project using archetype (Scala)

2016-04-28 Thread nsengupta
Hello all,

I don't know if anyone else has faced his; I haven't so far.

When I try to create a new project template following the instructions  here

 
, it fails.

This is what happens (along with the command I give):

nirmalya@Cheetah:~/Workspace-Flink$ mvn archetype:generate  

\
>   -DarchetypeGroupId=org.apache.flink  \
>   -DarchetypeArtifactId=flink-quickstart-scala \
>   -DarchetypeVersion=1.1-SNAPSHOT
[INFO] Scanning for projects...
[INFO] 
[INFO]

[INFO] Building Maven Stub Project (No POM) 1
[INFO]

[INFO] 
[INFO] >>> maven-archetype-plugin:2.3:generate (default-cli) >
generate-sources @ standalone-pom >>>
[INFO] 
[INFO] <<< maven-archetype-plugin:2.3:generate (default-cli) <
generate-sources @ standalone-pom <<<
[INFO] 
[INFO] --- maven-archetype-plugin:2.3:generate (default-cli) @
standalone-pom ---
[INFO] Generating project in Interactive mode
[INFO] Archetype repository not defined. Using the one from
[org.apache.flink:flink-quickstart-scala:1.0.2] found in catalog remote
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 01:15 min
[INFO] Finished at: 2016-04-28T22:22:57+05:30
[INFO] Final Memory: 14M/226M
[INFO]

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-archetype-plugin:2.3:generate (default-cli)
on project standalone-pom: The desired archetype does not exist
(org.apache.flink:flink-quickstart-scala:1.1-SNAPSHOT) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please
read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
nirmalya@Cheetah:~/Workspace-Flink$ 


Could someone please point out the mistake? 

-- Nirmalya



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Problem-in-creating-quickstart-project-using-archetype-Scala-tp6560.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: aggregation problem

2016-04-28 Thread Vasiliki Kalavri
Hi Riccardo,

can you please be a bit more specific? What do you mean by "it didn't
work"? Did it crash? Did it give you a wrong value? Something else?

-Vasia.

On 28 April 2016 at 16:52, Riccardo Diomedi 
wrote:

> Hi everybody
>
> In a DeltaIteration I have a DataSet> where, at a
> certain point of the iteration, i need to count the total number of tuples
> and the total number of elements in the HashSet of each tuple, and then
> send both value to the ConvergenceCriterion function.
>
> Example:
>
> this is the content of my DataSet:
> (*1*,2,*[2,3]*)
> (*2*,1,*[3,4]*)
> (*3*,2,[*4,5]*)
>
> i should have:
> first count: *3* (1,2,3)
> second count: *4* (2,3,4,5)
>
> i tried to iterate the dataset through a flatMap and exploit so an
> aggregator, putting an HashSet into it(Aggregator), but it didn’t work!
>
> Do you have any suggestion??
>
> thanks
>
> Riccardo
>


Re: Getting java.lang.Exception when try to fetch data from Kafka

2016-04-28 Thread Robert Metzger
I would refer to the SimpleStringSchema as an example.

On Wed, Apr 27, 2016 at 7:11 PM, prateekarora 
wrote:

> Thanks for the response .
>
> can you please suggest some link or example to write own
> DeserializationSchema ?
>
> Regards
> Prateek
>
> On Tue, Apr 26, 2016 at 11:06 AM, rmetzger0 [via Apache Flink User Mailing
> List archive.] <[hidden email]
> > wrote:
>
>> Hi Prateek,
>>
>> sorry for the late response. Can you try implementing your own
>> DeserializationSchema, where you deserialize the String key manually (just
>> call the "new String(byte[]) constructor).
>>
>> The TypeInformationKeyValueSerializationSchema[String, byte] is
>> generating deserializers with Flink's internal serializer stack (these
>> assume that the data has been serialized by Flink as well). I think Flink's
>> StringSerializer does some fancy optimizations and is not compatible with
>> the standard String() format.
>>
>>
>>
>> On Tue, Apr 26, 2016 at 6:34 PM, prateek arora <[hidden email]
>> > wrote:
>>
>>> Hi Robert ,
>>>
>>> Hi
>>>
>>> I have java program to send data into kafka topic. below is code for
>>> this :
>>>
>>> private Producer producer = null
>>>
>>> Serializer keySerializer = new StringSerializer();
>>> Serializer valueSerializer = new ByteArraySerializer();
>>> producer = new KafkaProducer(props, keySerializer,
>>> valueSerializer);
>>>
>>> ProducerRecord imageRecord;
>>> imageRecord = new ProducerRecord(streamInfo.topic,
>>> Integer.toString(messageKey), imageData);
>>>
>>> producer.send(imageRecord);
>>>
>>>
>>> then trying to fetch data in Apache flink .
>>>
>>> Regards
>>> Prateek
>>>
>>> On Mon, Apr 25, 2016 at 2:42 AM, Robert Metzger <[hidden email]
>>> > wrote:
>>>
 Hi Prateek,

 were the messages written to the Kafka topic by Flink, using the
 TypeInformationKeyValueSerializationSchema ? If not, maybe the Flink
 deserializers expect a different data format of the messages in the topic.

 How are the messages written into the topic?


 On Fri, Apr 22, 2016 at 10:21 PM, prateekarora <[hidden email]
 > wrote:

>
> Hi
>
> I am sending data using kafkaProducer API
>
>imageRecord = new ProducerRecord byte[]>(topic,messageKey, imageData);
> producer.send(imageRecord);
>
>
> And in flink program  try to fect data using FlinkKafkaConsumer08 .
> below
> are the sample code .
>
> def main(args: Array[String]) {
>   val env = StreamExecutionEnvironment.getExecutionEnvironment
>   val properties = new Properties()
>   properties.setProperty("bootstrap.servers",
> ":9092")
>   properties.setProperty("zookeeper.connect",
> ":2181")
>   properties.setProperty("group.id", "test")
>
>   val readSchema = new
>
> TypeInformationKeyValueSerializationSchema[String,Array[Byte]](classOf[String],classOf[Array[Byte]],
>
> env.getConfig).asInstanceOf[KeyedDeserializationSchema[(String,Array[Byte])]]
>
>   val stream : DataStream[(String,Array[Byte])]  =
> env.addSource(new
> FlinkKafkaConsumer08[(String,Array[Byte])]("a-0",readSchema,
> properties))
>
>   stream.print
>   env.execute("Flink Kafka Example")
>   }
>
>
> but getting  below error :
>
> 16/04/22 13:43:39 INFO ExecutionGraph: Source: Custom Source -> Sink:
> Unnamed (1/4) (d7a151560f6eabdc587a23dc0975cb84) switched from RUNNING
> to
> FAILED
> 16/04/22 13:43:39 INFO ExecutionGraph: Source: Custom Source -> Sink:
> Unnamed (2/4) (d43754a27e402ed5b02a73d1c9aa3125) switched from RUNNING
> to
> CANCELING
>
> java.lang.Exception
> at
>
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher.run(LegacyFetcher.java:222)
> at
>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08.run(FlinkKafkaConsumer08.java:316)
> at
>
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:78)
> at
>
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:56)
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:225)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.EOFException
> at
>
> org.apache.flink.runtime.util.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:298)
> at
> 

Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Ken Krugler
Hi Ufuk,

> On Apr 28, 2016, at 1:32am, Ufuk Celebi  wrote:
> 
> Hey Ken!
> 
> That should not happen. Can you check the web interface for two things:
> 
> - How many available slots are advertized on the landing page
> (localhost:8081) when you submit your job?

I’m running this on YARN, so I don’t believe the web UI shows up until the 
Flink AppManager has been started, which means I don’t know the advertised 
number of available slots before the job is running.

> - Can you check the actual parallelism of the submitted job (it should
> appear as a FAILED job in the web frontend). Is it really 15?

Same as above, the Flink web UI is gone once the job has failed.

Any suggestions for how to check the actual parallelism in this type of 
transient YARN environment?

Thanks,

— Ken


> On Thu, Apr 28, 2016 at 12:52 AM, Ken Krugler
>  wrote:
>> Hi all,
>> 
>> In trying out different settings for performance, I run into a job failure
>> case that puzzles me.
>> 
>> I’d done a run with a parallelism of 20 (-p 20 via CLI), and the job ran
>> successfully, on a cluster with 40 slots.
>> 
>> I then tried with -p 15, and it failed with:
>> 
>> NoResourceAvailableException: Not enough free slots available to run the
>> job. You can decrease the operator parallelism…
>> 
>> But the change was to reduce parallelism - why would that now cause this
>> problem?
>> 
>> Thanks,
>> 
>> — Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Ken Krugler

> On Apr 28, 2016, at 1:32am, Aljoscha Krettek  wrote:
> 
> Hi,
> is this a streaming or batch job?

Batch.

> If it is a batch job, are you using either collect() or print() on a DataSet?

Definitely not a print(). Don’t know about collect(), since the job is created 
via the Cascading-Flink planner. Fabian would know best.

— Ken

> 
> Cheers,
> Aljoscha
> 
> On Thu, 28 Apr 2016 at 00:52 Ken Krugler  > wrote:
> Hi all,
> 
> In trying out different settings for performance, I run into a job failure 
> case that puzzles me.
> 
> I’d done a run with a parallelism of 20 (-p 20 via CLI), and the job ran 
> successfully, on a cluster with 40 slots.
> 
> I then tried with -p 15, and it failed with:
> 
> NoResourceAvailableException: Not enough free slots available to run the job. 
> You can decrease the operator parallelism…
> 
> But the change was to reduce parallelism - why would that now cause this 
> problem?
> 
> Thanks,
> 
> — Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





aggregation problem

2016-04-28 Thread Riccardo Diomedi
Hi everybody

In a DeltaIteration I have a DataSet> where, at a 
certain point of the iteration, i need to count the total number of tuples and 
the total number of elements in the HashSet of each tuple, and then send both 
value to the ConvergenceCriterion function.

Example:

this is the content of my DataSet:
(1,2,[2,3])
(2,1,[3,4])
(3,2,[4,5])

i should have:
first count: 3 (1,2,3)
second count: 4 (2,3,4,5)

i tried to iterate the dataset through a flatMap and exploit so an aggregator, 
putting an HashSet into it(Aggregator), but it didn’t work!

Do you have any suggestion??

thanks 

Riccardo

Re: Flink log dir

2016-04-28 Thread Chesnay Schepler
according to https://issues.apache.org/jira/browse/FLINK-3678 it should 
be available in 1.0.3


On 28.04.2016 16:23, Flavio Pompermaier wrote:

Hi to all,
I'm using Flink 1.0.1 and I can't find how to change log directory.in 
 the current master I see that there's the 
*env.log.dir* parameter to configure. From which version it is/will be 
available?


Best,
Flavio




Re: Regarding Broadcast of datasets in streaming context

2016-04-28 Thread Gyula Fóra
Hi Biplob,

I have implemented a similar algorithm as Aljoscha mentioned.

First things to clarify are the following:
There is currently no abstraction for keeping objects (in you case
centroids) in a centralized way that can be updated/read by all operators.
This would probably be very costly and is actually not necessary in your
case.

Broadcast a stream in contrast with other partitioning methods mean that
the events will be replicated to all downstream operators. This not a
magical operator that will make state available among parallel instances.

Now let me explain what I think you want from Flink and how to do it :)

You have input data stream and a set of centroids to be updated based on
the incoming records. As you want to do this in parallel you have an
operator (let's say a flatmap) that keeps the centroids locally and updates
it on it's inputs. Now you have a set of independently updated centroids,
so you want to merge them and update the centroids in each flatmap.

Let's see how to do this. Given that you have your centroids locally,
updating them is super easy, so I will not talk about that. The problematic
part is periodically merging end "broadcasting" the centroids so all the
flatmaps eventually see the same (they don't have to always be the same for
clustering probably). There is no operator for sending state (centroids)
between subtasks so you have to be clever here. We can actually use cyclic
streams to solve this problem by sending the centroids as simple events to
a CoFlatMap:

DataStream input = ...
ConnectedIterativeStreams inputsAndCentroids =
input.iterate().withFeedbackType(Centroids.class)
DataStream updatedCentroids =
inputsAndCentroids.flatMap(MyCoFlatmap)
inputsAndCentroids.closeWith(updatedCentroids.broadcast())

MyCoFlatmap would be a CoFlatMapFunction which on 1 input receive events
and update its local centroids (and periodically output the centroids) and
on the other input would send centroids of other flatmaps and would merge
them to the local.

This might be a lot to take in at first, so you might want to read up on
streaming iterations and connected streams before you start.

Let me know if this makes sense.

Cheers,
Gyula


Biplob Biswas  ezt írta (időpont: 2016. ápr. 28.,
Cs, 14:41):

> That would really be great, any example would help me proceed with my work.
> Thanks a lot.
>
>
> Aljoscha Krettek wrote
> > Hi Biplob,
> > one of our developers had a stream clustering example a while back. It
> was
> > using a broadcast feedback edge with a co-operator to update the
> > centroids.
> > I'll directly include him in the email so that he will notice and can
> send
> > you the example.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 28 Apr 2016 at 13:57 Biplob Biswas 
>
> > revolutionisme@
>
> >  wrote:
> >
> >> I am pretty new to flink systems, thus can anyone atleast give me an
> >> example
> >> of how datastream.broadcast() method works? From the documentation i get
> >> the
> >> following:
> >>
> >> broadcast()
> >> Sets the partitioning of the DataStream so that the output elements are
> >> broadcasted to every parallel instance of the next operation.
> >>
> >> If the output elements are broadcasted, then how are they retrieved? Or
> >> maybe I am looking at this method in a completely wrong way?
> >>
> >> Thanks
> >> Biplob Biswas
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Regarding-Broadcast-of-datasets-in-streaming-context-tp6456p6543.html
> >> Sent from the Apache Flink User Mailing List archive. mailing list
> >> archive
> >> at Nabble.com.
> >>
>
>
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Regarding-Broadcast-of-datasets-in-streaming-context-tp6456p6548.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Flink log dir

2016-04-28 Thread Flavio Pompermaier
Hi to all,

I'm using Flink 1.0.1 and I can't find how to change log directory.in the
current master I see that there's the *env.log.dir* parameter to configure.
>From which version it is/will be available?

Best,
Flavio


Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
Yes, assigning more than 0.5GB to a JM is a good idea. 3GB is maybe a bit
too much, 2GB should be enough.
Increasing the timeout should not hurt either.

2016-04-28 14:14 GMT+02:00 Flavio Pompermaier :

> So what do you suggest to try for the next run?
> I was going to increase the Job Manager heap to 3 GB and maybe change some
> gc setting.
> Do you think I should increase also the akka timeout or other things?
>
> On Thu, Apr 28, 2016 at 2:06 PM, Fabian Hueske  wrote:
>
>> Hmm, 113k splits is quite a lot.
>> However, the IF uses the DefaultInputSplitAssigner which is very
>> lightweight and should handle a large number of splits well.
>>
>>
>>
>> 2016-04-28 13:50 GMT+02:00 Flavio Pompermaier :
>>
>>> We generate 113k splits because we can't query more than 100k or records
>>> per split (and we have to manage 11 billions of records). We tried to run
>>> the job only once, before running it the 2nd time we would like to
>>> understand which parameter to tune in order to (try to at least to) avoid
>>> such an error.
>>>
>>> Of course I pasted the wrong TM heap size...that is indeed 3Gb (
>>> taskmanager.heap.mb:512)
>>>
>>> Best,
>>> Flavio
>>>
>>> On Thu, Apr 28, 2016 at 1:29 PM, Fabian Hueske 
>>> wrote:
>>>
 Is the problem reproducible?
 Maybe the SplitAssigner gets stuck somehow, but I've never observed
 something like that.

 How many splits do you generate?

 I guess it is not related, but 512MB for a TM is not a lot on machines
 with 16GB RAM.

 2016-04-28 12:12 GMT+02:00 Flavio Pompermaier :

> When does this usually happens? Is it because the JobManager has too
> few resources (of some type)?
>
> Our current configuration of the cluster has 4 machines (with 4 CPUs
> and 16 GB of RAM) and one machine has both a JobManager and a TaskManger
> (the other 3 just a TM).
>
> Our flink-conf.yml on every machine has the following params:
>
>- jobmanager.heap.mb:512
>- taskmanager.heap.mb:512
>- taskmanager.numberOfTaskSlots:6
>- prallelism.default:24
>- env.java.home=/usr/lib/jvm/java-8-oracle/
>- taskmanager.network.numberOfBuffers:16384
>
> The job just read a window of max 100k elements and then writes a
> Tuple5 into a CSV on the jobmanger fs with parallelism 1 (in order to
> produce a single file). The job dies after 40 minutes and hundreds of
> millions of records read.
>
> Do you see anything sospicious?
>
> Thanks for the support,
> Flavio
>
> On Thu, Apr 28, 2016 at 11:54 AM, Fabian Hueske 
> wrote:
>
>> I checked the input format from your PR, but didn't see anything
>> suspicious.
>>
>> It is definitely OK if the processing of an input split tasks more
>> than 10 seconds. That should not be the cause.
>> It rather looks like the DataSourceTask fails to request a new split
>> from the JobManager.
>>
>> 2016-04-28 9:37 GMT+02:00 Stefano Bortoli :
>>
>>> Digging the logs, we found this:
>>>
>>> WARN  Remoting - Tried to associate with unreachable remote address
>>> [akka.tcp://flink@127.0.0.1:34984]. Address is now gated for 5000
>>> ms, all messages to this address will be delivered to dead letters. 
>>> Reason:
>>> Connessione rifiutata: /127.0.0.1:34984
>>>
>>> however, it is not clear why it should refuse a connection to itself
>>> after 40min of run. we'll try to figure out possible environment issues.
>>> Its a fresh installation, therefore we may have left out some
>>> configurations.
>>>
>>> saluti,
>>> Stefano
>>>
>>> 2016-04-28 9:22 GMT+02:00 Stefano Bortoli :
>>>
 I had this type of exception when trying to build and test Flink on
 a "small machine". I worked around the test increasing the timeout for 
 Akka.


 https://github.com/stefanobortoli/flink/blob/FLINK-1827/flink-tests/src/test/java/org/apache/flink/test/checkpointing/EventTimeAllWindowCheckpointingITCase.java

 it happened only on my machine (a VirtualBox I use for
 development), but not on Flavio's. Is it possible that on load 
 situations
 the JobManager slows down a bit too much?

 saluti,
 Stefano

 2016-04-27 17:50 GMT+02:00 Flavio Pompermaier :

> A precursor of the modified connector (since we started a long
> time ago). However the idea is the same, I compute the inputSplits 
> and then
> I get the data split by split (similarly to what it happens in 
> FLINK-3750 -
> https://github.com/apache/flink/pull/1941 )
>
> Best,
> Flavio

Re: General Data questions - streams vs batch

2016-04-28 Thread Fabian Hueske
True, flatMap does not have access to watermarks.

You can also go a bit more to the low levels and directly implement an
AbstractStreamOperator with OneInputStreamOperatorInterface.
This is kind of the base class for the built-in stream operators and it has
access to Watermarks (OneInputStreamOperator.processWatermark()).

Maybe the easiest is to simply extend StreamFlatMap and override the
processWatermark() method.

Cheers, Fabian

2016-04-28 14:40 GMT+02:00 Konstantin Kulagin :

> Thanks Fabian,
>
> works like a charm except the case when the stream is finite (or i have a
> dataset from the beginning).
>
> In this case I need somehow identify that stream is finished and emit
> latest batch (which might have less amount of elements) to output.
> What is the best way to do that? In streams and windows we have support
> for watermarks, but I do not see similar stuff for a flatMap operation?
>
> In the sample below I need to emit values from 30 to 32 as well:
>
>   ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
>   DataSet> source = 
> env.fromCollection(LongStream.range(0, 33).mapToObj(l ->
> Tuple2.of(l, "This is " + l)).collect(Collectors.toList()));
>
>   source.flatMap(new RichFlatMapFunction, Tuple2 String>>() {
> List> cache = new ArrayList<>();
>
> @Override
> public RuntimeContext getRuntimeContext() {
>   return super.getRuntimeContext();
> }
>
> @Override
> public void flatMap(Tuple2 value, Collector String>> out) throws Exception {
>   cache.add(value);
>   if (cache.size() == 5) {
> System.out.println("! " + Thread.currentThread().getId() + ":  " 
> + Joiner.on(",").join(cache));
> cache.stream().forEach(out::collect);
> cache.clear();
>   }
> }
>   }).setParallelism(2).print();
>
>   env.execute("yoyoyo");
> }
>
>
> Output (flink realted stuff excluded):
>
> ! 35:  (1,This is 1),(3,This is 3),(5,This is 5),(7,This is 7),(9,This
> is 9)
> ! 36:  (0,This is 0),(2,This is 2),(4,This is 4),(6,This is 6),(8,This
> is 8)
> ! 35:  (11,This is 11),(13,This is 13),(15,This is 15),(17,This is
> 17),(19,This is 19)
> ! 36:  (10,This is 10),(12,This is 12),(14,This is 14),(16,This is
> 16),(18,This is 18)
> ! 35:  (21,This is 21),(23,This is 23),(25,This is 25),(27,This is
> 27),(29,This is 29)
> ! 36:  (20,This is 20),(22,This is 22),(24,This is 24),(26,This is
> 26),(28,This is 28)
>
>
> And if you can give a bit more info on why will I have latency issues in a
> case of varying rate of arrival elements that would be perfect. Or point me
> to a direction where I can read it.
>
> Thanks!
> Konstantin.
>
> On Thu, Apr 28, 2016 at 7:26 AM, Fabian Hueske  wrote:
>
>> Hi Konstantin,
>>
>> if you do not need a deterministic grouping of elements you should not
>> use a keyed stream or window.
>> Instead you can do the lookups in a parallel flatMap function. The
>> function would collect arriving elements and perform a lookup query after a
>> certain number of elements arrived (can cause high latency if the arrival
>> rate of elements is low or varies).
>> The flatmap function can be executed in parallel and does not require a
>> keyed stream.
>>
>> Best, Fabian
>>
>>
>> 2016-04-25 18:58 GMT+02:00 Konstantin Kulagin :
>>
>>> As usual - thanks for answers, Aljoscha!
>>>
>>> I think I understood what I want to know.
>>>
>>> 1) To add few comments: about streams I was thinking about something
>>> similar to Storm where you can have one Source and 'duplicate' the same
>>> entry going through different 'path's.
>>> Something like this:
>>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_storm-user-guide/content/figures/1/figures/SpoutsAndBolts.png
>>> And later you can 'join' these separate streams back.
>>> And actually I think this is what I meant:
>>> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/datastream/JoinedStreams.html
>>> - this one actually 'joins' by window.
>>>
>>> As for 'exact-once-guarantee' I've got the difference from this paper:
>>> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink
>>> - Thanks!
>>>
>>> 2) understood, thank you very much
>>>
>>>
>>>
>>>
>>>
>>>
>>> I'll probably bother you one more time with another question:
>>>
>>> 3) Lets say I have a Source which provides raw (i.e. non-keyed) data.
>>> And lets say I need to 'enhance' each entry with some fields which I can
>>> take from a database.
>>> So I define some DbEnhanceOperation
>>>
>>> Database query might be expensive - so I would want to
>>> a) batch entries to perform queries
>>> b) be able to have several parallel DbEnhaceOperations so those will not
>>> slow down my whole processing.
>>>
>>>
>>> I do not see a way to do that?

Re: Regarding Broadcast of datasets in streaming context

2016-04-28 Thread Biplob Biswas
That would really be great, any example would help me proceed with my work.
Thanks a lot.


Aljoscha Krettek wrote
> Hi Biplob,
> one of our developers had a stream clustering example a while back. It was
> using a broadcast feedback edge with a co-operator to update the
> centroids.
> I'll directly include him in the email so that he will notice and can send
> you the example.
> 
> Cheers,
> Aljoscha
> 
> On Thu, 28 Apr 2016 at 13:57 Biplob Biswas 

> revolutionisme@

>  wrote:
> 
>> I am pretty new to flink systems, thus can anyone atleast give me an
>> example
>> of how datastream.broadcast() method works? From the documentation i get
>> the
>> following:
>>
>> broadcast()
>> Sets the partitioning of the DataStream so that the output elements are
>> broadcasted to every parallel instance of the next operation.
>>
>> If the output elements are broadcasted, then how are they retrieved? Or
>> maybe I am looking at this method in a completely wrong way?
>>
>> Thanks
>> Biplob Biswas
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Regarding-Broadcast-of-datasets-in-streaming-context-tp6456p6543.html
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive
>> at Nabble.com.
>>





--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Regarding-Broadcast-of-datasets-in-streaming-context-tp6456p6548.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: General Data questions - streams vs batch

2016-04-28 Thread Konstantin Kulagin
Thanks Fabian,

works like a charm except the case when the stream is finite (or i have a
dataset from the beginning).

In this case I need somehow identify that stream is finished and emit
latest batch (which might have less amount of elements) to output.
What is the best way to do that? In streams and windows we have support for
watermarks, but I do not see similar stuff for a flatMap operation?

In the sample below I need to emit values from 30 to 32 as well:

  ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
  DataSet> source =
env.fromCollection(LongStream.range(0, 33).mapToObj(l ->
Tuple2.of(l, "This is " + l)).collect(Collectors.toList()));

  source.flatMap(new RichFlatMapFunction,
Tuple2>() {
List> cache = new ArrayList<>();

@Override
public RuntimeContext getRuntimeContext() {
  return super.getRuntimeContext();
}

@Override
public void flatMap(Tuple2 value,
Collector> out) throws Exception {
  cache.add(value);
  if (cache.size() == 5) {
System.out.println("! " + Thread.currentThread().getId() +
":  " + Joiner.on(",").join(cache));
cache.stream().forEach(out::collect);
cache.clear();
  }
}
  }).setParallelism(2).print();

  env.execute("yoyoyo");
}


Output (flink realted stuff excluded):

! 35:  (1,This is 1),(3,This is 3),(5,This is 5),(7,This is 7),(9,This
is 9)
! 36:  (0,This is 0),(2,This is 2),(4,This is 4),(6,This is 6),(8,This
is 8)
! 35:  (11,This is 11),(13,This is 13),(15,This is 15),(17,This is
17),(19,This is 19)
! 36:  (10,This is 10),(12,This is 12),(14,This is 14),(16,This is
16),(18,This is 18)
! 35:  (21,This is 21),(23,This is 23),(25,This is 25),(27,This is
27),(29,This is 29)
! 36:  (20,This is 20),(22,This is 22),(24,This is 24),(26,This is
26),(28,This is 28)


And if you can give a bit more info on why will I have latency issues in a
case of varying rate of arrival elements that would be perfect. Or point me
to a direction where I can read it.

Thanks!
Konstantin.

On Thu, Apr 28, 2016 at 7:26 AM, Fabian Hueske  wrote:

> Hi Konstantin,
>
> if you do not need a deterministic grouping of elements you should not use
> a keyed stream or window.
> Instead you can do the lookups in a parallel flatMap function. The
> function would collect arriving elements and perform a lookup query after a
> certain number of elements arrived (can cause high latency if the arrival
> rate of elements is low or varies).
> The flatmap function can be executed in parallel and does not require a
> keyed stream.
>
> Best, Fabian
>
>
> 2016-04-25 18:58 GMT+02:00 Konstantin Kulagin :
>
>> As usual - thanks for answers, Aljoscha!
>>
>> I think I understood what I want to know.
>>
>> 1) To add few comments: about streams I was thinking about something
>> similar to Storm where you can have one Source and 'duplicate' the same
>> entry going through different 'path's.
>> Something like this:
>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_storm-user-guide/content/figures/1/figures/SpoutsAndBolts.png
>> And later you can 'join' these separate streams back.
>> And actually I think this is what I meant:
>> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/datastream/JoinedStreams.html
>> - this one actually 'joins' by window.
>>
>> As for 'exact-once-guarantee' I've got the difference from this paper:
>> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink
>> - Thanks!
>>
>> 2) understood, thank you very much
>>
>>
>>
>>
>>
>>
>> I'll probably bother you one more time with another question:
>>
>> 3) Lets say I have a Source which provides raw (i.e. non-keyed) data. And
>> lets say I need to 'enhance' each entry with some fields which I can take
>> from a database.
>> So I define some DbEnhanceOperation
>>
>> Database query might be expensive - so I would want to
>> a) batch entries to perform queries
>> b) be able to have several parallel DbEnhaceOperations so those will not
>> slow down my whole processing.
>>
>>
>> I do not see a way to do that?
>>
>>
>> Problems:
>>
>> I cannot go with countWindowAll because of b) - that thing does not
>> support several streams (correct?)
>>
>> So I need to create a windowed stream and for that I need to have some
>> key - Correct? I.e cannot create windows on a stream of general object just
>> using number of objects.
>>
>> I probably can 'emulate' keyed stream by providing some 'fake' key. But
>> in this case I can parallelize only on different keys. Again - it is
>> probably doable by introducing some AtomicLong key generator at the first
>> place ( this part probably hard to understand - I can share details if
>> necessary) but still looks like a bit of hack :)
>>
>> But the 

Re: Regarding Broadcast of datasets in streaming context

2016-04-28 Thread Aljoscha Krettek
Hi Biplob,
one of our developers had a stream clustering example a while back. It was
using a broadcast feedback edge with a co-operator to update the centroids.
I'll directly include him in the email so that he will notice and can send
you the example.

Cheers,
Aljoscha

On Thu, 28 Apr 2016 at 13:57 Biplob Biswas  wrote:

> I am pretty new to flink systems, thus can anyone atleast give me an
> example
> of how datastream.broadcast() method works? From the documentation i get
> the
> following:
>
> broadcast()
> Sets the partitioning of the DataStream so that the output elements are
> broadcasted to every parallel instance of the next operation.
>
> If the output elements are broadcasted, then how are they retrieved? Or
> maybe I am looking at this method in a completely wrong way?
>
> Thanks
> Biplob Biswas
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Regarding-Broadcast-of-datasets-in-streaming-context-tp6456p6543.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Re: Requesting the next InputSplit failed

2016-04-28 Thread Flavio Pompermaier
So what do you suggest to try for the next run?
I was going to increase the Job Manager heap to 3 GB and maybe change some
gc setting.
Do you think I should increase also the akka timeout or other things?

On Thu, Apr 28, 2016 at 2:06 PM, Fabian Hueske  wrote:

> Hmm, 113k splits is quite a lot.
> However, the IF uses the DefaultInputSplitAssigner which is very
> lightweight and should handle a large number of splits well.
>
>
>
> 2016-04-28 13:50 GMT+02:00 Flavio Pompermaier :
>
>> We generate 113k splits because we can't query more than 100k or records
>> per split (and we have to manage 11 billions of records). We tried to run
>> the job only once, before running it the 2nd time we would like to
>> understand which parameter to tune in order to (try to at least to) avoid
>> such an error.
>>
>> Of course I pasted the wrong TM heap size...that is indeed 3Gb (
>> taskmanager.heap.mb:512)
>>
>> Best,
>> Flavio
>>
>> On Thu, Apr 28, 2016 at 1:29 PM, Fabian Hueske  wrote:
>>
>>> Is the problem reproducible?
>>> Maybe the SplitAssigner gets stuck somehow, but I've never observed
>>> something like that.
>>>
>>> How many splits do you generate?
>>>
>>> I guess it is not related, but 512MB for a TM is not a lot on machines
>>> with 16GB RAM.
>>>
>>> 2016-04-28 12:12 GMT+02:00 Flavio Pompermaier :
>>>
 When does this usually happens? Is it because the JobManager has too
 few resources (of some type)?

 Our current configuration of the cluster has 4 machines (with 4 CPUs
 and 16 GB of RAM) and one machine has both a JobManager and a TaskManger
 (the other 3 just a TM).

 Our flink-conf.yml on every machine has the following params:

- jobmanager.heap.mb:512
- taskmanager.heap.mb:512
- taskmanager.numberOfTaskSlots:6
- prallelism.default:24
- env.java.home=/usr/lib/jvm/java-8-oracle/
- taskmanager.network.numberOfBuffers:16384

 The job just read a window of max 100k elements and then writes a
 Tuple5 into a CSV on the jobmanger fs with parallelism 1 (in order to
 produce a single file). The job dies after 40 minutes and hundreds of
 millions of records read.

 Do you see anything sospicious?

 Thanks for the support,
 Flavio

 On Thu, Apr 28, 2016 at 11:54 AM, Fabian Hueske 
 wrote:

> I checked the input format from your PR, but didn't see anything
> suspicious.
>
> It is definitely OK if the processing of an input split tasks more
> than 10 seconds. That should not be the cause.
> It rather looks like the DataSourceTask fails to request a new split
> from the JobManager.
>
> 2016-04-28 9:37 GMT+02:00 Stefano Bortoli :
>
>> Digging the logs, we found this:
>>
>> WARN  Remoting - Tried to associate with unreachable remote address
>> [akka.tcp://flink@127.0.0.1:34984]. Address is now gated for 5000
>> ms, all messages to this address will be delivered to dead letters. 
>> Reason:
>> Connessione rifiutata: /127.0.0.1:34984
>>
>> however, it is not clear why it should refuse a connection to itself
>> after 40min of run. we'll try to figure out possible environment issues.
>> Its a fresh installation, therefore we may have left out some
>> configurations.
>>
>> saluti,
>> Stefano
>>
>> 2016-04-28 9:22 GMT+02:00 Stefano Bortoli :
>>
>>> I had this type of exception when trying to build and test Flink on
>>> a "small machine". I worked around the test increasing the timeout for 
>>> Akka.
>>>
>>>
>>> https://github.com/stefanobortoli/flink/blob/FLINK-1827/flink-tests/src/test/java/org/apache/flink/test/checkpointing/EventTimeAllWindowCheckpointingITCase.java
>>>
>>> it happened only on my machine (a VirtualBox I use for development),
>>> but not on Flavio's. Is it possible that on load situations the 
>>> JobManager
>>> slows down a bit too much?
>>>
>>> saluti,
>>> Stefano
>>>
>>> 2016-04-27 17:50 GMT+02:00 Flavio Pompermaier 
>>> :
>>>
 A precursor of the modified connector (since we started a long time
 ago). However the idea is the same, I compute the inputSplits and then 
 I
 get the data split by split (similarly to what it happens in 
 FLINK-3750 -
 https://github.com/apache/flink/pull/1941 )

 Best,
 Flavio

 On Wed, Apr 27, 2016 at 5:38 PM, Chesnay Schepler <
 ches...@apache.org> wrote:

> Are you using your modified connector or the currently available
> one?
>
>
> On 27.04.2016 17:35, Flavio Pompermaier wrote:
>
> Hi to all,
> I'm running a Flink Job on a JDBC 

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
Hmm, 113k splits is quite a lot.
However, the IF uses the DefaultInputSplitAssigner which is very
lightweight and should handle a large number of splits well.



2016-04-28 13:50 GMT+02:00 Flavio Pompermaier :

> We generate 113k splits because we can't query more than 100k or records
> per split (and we have to manage 11 billions of records). We tried to run
> the job only once, before running it the 2nd time we would like to
> understand which parameter to tune in order to (try to at least to) avoid
> such an error.
>
> Of course I pasted the wrong TM heap size...that is indeed 3Gb (
> taskmanager.heap.mb:512)
>
> Best,
> Flavio
>
> On Thu, Apr 28, 2016 at 1:29 PM, Fabian Hueske  wrote:
>
>> Is the problem reproducible?
>> Maybe the SplitAssigner gets stuck somehow, but I've never observed
>> something like that.
>>
>> How many splits do you generate?
>>
>> I guess it is not related, but 512MB for a TM is not a lot on machines
>> with 16GB RAM.
>>
>> 2016-04-28 12:12 GMT+02:00 Flavio Pompermaier :
>>
>>> When does this usually happens? Is it because the JobManager has too few
>>> resources (of some type)?
>>>
>>> Our current configuration of the cluster has 4 machines (with 4 CPUs and
>>> 16 GB of RAM) and one machine has both a JobManager and a TaskManger (the
>>> other 3 just a TM).
>>>
>>> Our flink-conf.yml on every machine has the following params:
>>>
>>>- jobmanager.heap.mb:512
>>>- taskmanager.heap.mb:512
>>>- taskmanager.numberOfTaskSlots:6
>>>- prallelism.default:24
>>>- env.java.home=/usr/lib/jvm/java-8-oracle/
>>>- taskmanager.network.numberOfBuffers:16384
>>>
>>> The job just read a window of max 100k elements and then writes a Tuple5
>>> into a CSV on the jobmanger fs with parallelism 1 (in order to produce a
>>> single file). The job dies after 40 minutes and hundreds of millions of
>>> records read.
>>>
>>> Do you see anything sospicious?
>>>
>>> Thanks for the support,
>>> Flavio
>>>
>>> On Thu, Apr 28, 2016 at 11:54 AM, Fabian Hueske 
>>> wrote:
>>>
 I checked the input format from your PR, but didn't see anything
 suspicious.

 It is definitely OK if the processing of an input split tasks more than
 10 seconds. That should not be the cause.
 It rather looks like the DataSourceTask fails to request a new split
 from the JobManager.

 2016-04-28 9:37 GMT+02:00 Stefano Bortoli :

> Digging the logs, we found this:
>
> WARN  Remoting - Tried to associate with unreachable remote address
> [akka.tcp://flink@127.0.0.1:34984]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> Connessione rifiutata: /127.0.0.1:34984
>
> however, it is not clear why it should refuse a connection to itself
> after 40min of run. we'll try to figure out possible environment issues.
> Its a fresh installation, therefore we may have left out some
> configurations.
>
> saluti,
> Stefano
>
> 2016-04-28 9:22 GMT+02:00 Stefano Bortoli :
>
>> I had this type of exception when trying to build and test Flink on a
>> "small machine". I worked around the test increasing the timeout for 
>> Akka.
>>
>>
>> https://github.com/stefanobortoli/flink/blob/FLINK-1827/flink-tests/src/test/java/org/apache/flink/test/checkpointing/EventTimeAllWindowCheckpointingITCase.java
>>
>> it happened only on my machine (a VirtualBox I use for development),
>> but not on Flavio's. Is it possible that on load situations the 
>> JobManager
>> slows down a bit too much?
>>
>> saluti,
>> Stefano
>>
>> 2016-04-27 17:50 GMT+02:00 Flavio Pompermaier :
>>
>>> A precursor of the modified connector (since we started a long time
>>> ago). However the idea is the same, I compute the inputSplits and then I
>>> get the data split by split (similarly to what it happens in FLINK-3750 
>>> -
>>> https://github.com/apache/flink/pull/1941 )
>>>
>>> Best,
>>> Flavio
>>>
>>> On Wed, Apr 27, 2016 at 5:38 PM, Chesnay Schepler <
>>> ches...@apache.org> wrote:
>>>
 Are you using your modified connector or the currently available
 one?


 On 27.04.2016 17:35, Flavio Pompermaier wrote:

 Hi to all,
 I'm running a Flink Job on a JDBC datasource and I obtain the
 following exception:

 java.lang.RuntimeException: Requesting the next InputSplit failed.
 at
 org.apache.flink.runtime.taskmanager.TaskInputSplitProvider.getNextInputSplit(TaskInputSplitProvider.java:91)
 at
 org.apache.flink.runtime.operators.DataSourceTask$1.hasNext(DataSourceTask.java:342)
 at
 

Re: Requesting the next InputSplit failed

2016-04-28 Thread Flavio Pompermaier
We generate 113k splits because we can't query more than 100k or records
per split (and we have to manage 11 billions of records). We tried to run
the job only once, before running it the 2nd time we would like to
understand which parameter to tune in order to (try to at least to) avoid
such an error.

Of course I pasted the wrong TM heap size...that is indeed 3Gb (
taskmanager.heap.mb:512)

Best,
Flavio

On Thu, Apr 28, 2016 at 1:29 PM, Fabian Hueske  wrote:

> Is the problem reproducible?
> Maybe the SplitAssigner gets stuck somehow, but I've never observed
> something like that.
>
> How many splits do you generate?
>
> I guess it is not related, but 512MB for a TM is not a lot on machines
> with 16GB RAM.
>
> 2016-04-28 12:12 GMT+02:00 Flavio Pompermaier :
>
>> When does this usually happens? Is it because the JobManager has too few
>> resources (of some type)?
>>
>> Our current configuration of the cluster has 4 machines (with 4 CPUs and
>> 16 GB of RAM) and one machine has both a JobManager and a TaskManger (the
>> other 3 just a TM).
>>
>> Our flink-conf.yml on every machine has the following params:
>>
>>- jobmanager.heap.mb:512
>>- taskmanager.heap.mb:512
>>- taskmanager.numberOfTaskSlots:6
>>- prallelism.default:24
>>- env.java.home=/usr/lib/jvm/java-8-oracle/
>>- taskmanager.network.numberOfBuffers:16384
>>
>> The job just read a window of max 100k elements and then writes a Tuple5
>> into a CSV on the jobmanger fs with parallelism 1 (in order to produce a
>> single file). The job dies after 40 minutes and hundreds of millions of
>> records read.
>>
>> Do you see anything sospicious?
>>
>> Thanks for the support,
>> Flavio
>>
>> On Thu, Apr 28, 2016 at 11:54 AM, Fabian Hueske 
>> wrote:
>>
>>> I checked the input format from your PR, but didn't see anything
>>> suspicious.
>>>
>>> It is definitely OK if the processing of an input split tasks more than
>>> 10 seconds. That should not be the cause.
>>> It rather looks like the DataSourceTask fails to request a new split
>>> from the JobManager.
>>>
>>> 2016-04-28 9:37 GMT+02:00 Stefano Bortoli :
>>>
 Digging the logs, we found this:

 WARN  Remoting - Tried to associate with unreachable remote address
 [akka.tcp://flink@127.0.0.1:34984]. Address is now gated for 5000 ms,
 all messages to this address will be delivered to dead letters. Reason:
 Connessione rifiutata: /127.0.0.1:34984

 however, it is not clear why it should refuse a connection to itself
 after 40min of run. we'll try to figure out possible environment issues.
 Its a fresh installation, therefore we may have left out some
 configurations.

 saluti,
 Stefano

 2016-04-28 9:22 GMT+02:00 Stefano Bortoli :

> I had this type of exception when trying to build and test Flink on a
> "small machine". I worked around the test increasing the timeout for Akka.
>
>
> https://github.com/stefanobortoli/flink/blob/FLINK-1827/flink-tests/src/test/java/org/apache/flink/test/checkpointing/EventTimeAllWindowCheckpointingITCase.java
>
> it happened only on my machine (a VirtualBox I use for development),
> but not on Flavio's. Is it possible that on load situations the JobManager
> slows down a bit too much?
>
> saluti,
> Stefano
>
> 2016-04-27 17:50 GMT+02:00 Flavio Pompermaier :
>
>> A precursor of the modified connector (since we started a long time
>> ago). However the idea is the same, I compute the inputSplits and then I
>> get the data split by split (similarly to what it happens in FLINK-3750 -
>> https://github.com/apache/flink/pull/1941 )
>>
>> Best,
>> Flavio
>>
>> On Wed, Apr 27, 2016 at 5:38 PM, Chesnay Schepler > > wrote:
>>
>>> Are you using your modified connector or the currently available one?
>>>
>>>
>>> On 27.04.2016 17:35, Flavio Pompermaier wrote:
>>>
>>> Hi to all,
>>> I'm running a Flink Job on a JDBC datasource and I obtain the
>>> following exception:
>>>
>>> java.lang.RuntimeException: Requesting the next InputSplit failed.
>>> at
>>> org.apache.flink.runtime.taskmanager.TaskInputSplitProvider.getNextInputSplit(TaskInputSplitProvider.java:91)
>>> at
>>> org.apache.flink.runtime.operators.DataSourceTask$1.hasNext(DataSourceTask.java:342)
>>> at
>>> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:137)
>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out
>>> after [1 milliseconds]
>>> at
>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>> at
>>> 

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
Is the problem reproducible?
Maybe the SplitAssigner gets stuck somehow, but I've never observed
something like that.

How many splits do you generate?

I guess it is not related, but 512MB for a TM is not a lot on machines with
16GB RAM.

2016-04-28 12:12 GMT+02:00 Flavio Pompermaier :

> When does this usually happens? Is it because the JobManager has too few
> resources (of some type)?
>
> Our current configuration of the cluster has 4 machines (with 4 CPUs and
> 16 GB of RAM) and one machine has both a JobManager and a TaskManger (the
> other 3 just a TM).
>
> Our flink-conf.yml on every machine has the following params:
>
>- jobmanager.heap.mb:512
>- taskmanager.heap.mb:512
>- taskmanager.numberOfTaskSlots:6
>- prallelism.default:24
>- env.java.home=/usr/lib/jvm/java-8-oracle/
>- taskmanager.network.numberOfBuffers:16384
>
> The job just read a window of max 100k elements and then writes a Tuple5
> into a CSV on the jobmanger fs with parallelism 1 (in order to produce a
> single file). The job dies after 40 minutes and hundreds of millions of
> records read.
>
> Do you see anything sospicious?
>
> Thanks for the support,
> Flavio
>
> On Thu, Apr 28, 2016 at 11:54 AM, Fabian Hueske  wrote:
>
>> I checked the input format from your PR, but didn't see anything
>> suspicious.
>>
>> It is definitely OK if the processing of an input split tasks more than
>> 10 seconds. That should not be the cause.
>> It rather looks like the DataSourceTask fails to request a new split from
>> the JobManager.
>>
>> 2016-04-28 9:37 GMT+02:00 Stefano Bortoli :
>>
>>> Digging the logs, we found this:
>>>
>>> WARN  Remoting - Tried to associate with unreachable remote address
>>> [akka.tcp://flink@127.0.0.1:34984]. Address is now gated for 5000 ms,
>>> all messages to this address will be delivered to dead letters. Reason:
>>> Connessione rifiutata: /127.0.0.1:34984
>>>
>>> however, it is not clear why it should refuse a connection to itself
>>> after 40min of run. we'll try to figure out possible environment issues.
>>> Its a fresh installation, therefore we may have left out some
>>> configurations.
>>>
>>> saluti,
>>> Stefano
>>>
>>> 2016-04-28 9:22 GMT+02:00 Stefano Bortoli :
>>>
 I had this type of exception when trying to build and test Flink on a
 "small machine". I worked around the test increasing the timeout for Akka.


 https://github.com/stefanobortoli/flink/blob/FLINK-1827/flink-tests/src/test/java/org/apache/flink/test/checkpointing/EventTimeAllWindowCheckpointingITCase.java

 it happened only on my machine (a VirtualBox I use for development),
 but not on Flavio's. Is it possible that on load situations the JobManager
 slows down a bit too much?

 saluti,
 Stefano

 2016-04-27 17:50 GMT+02:00 Flavio Pompermaier :

> A precursor of the modified connector (since we started a long time
> ago). However the idea is the same, I compute the inputSplits and then I
> get the data split by split (similarly to what it happens in FLINK-3750 -
> https://github.com/apache/flink/pull/1941 )
>
> Best,
> Flavio
>
> On Wed, Apr 27, 2016 at 5:38 PM, Chesnay Schepler 
> wrote:
>
>> Are you using your modified connector or the currently available one?
>>
>>
>> On 27.04.2016 17:35, Flavio Pompermaier wrote:
>>
>> Hi to all,
>> I'm running a Flink Job on a JDBC datasource and I obtain the
>> following exception:
>>
>> java.lang.RuntimeException: Requesting the next InputSplit failed.
>> at
>> org.apache.flink.runtime.taskmanager.TaskInputSplitProvider.getNextInputSplit(TaskInputSplitProvider.java:91)
>> at
>> org.apache.flink.runtime.operators.DataSourceTask$1.hasNext(DataSourceTask.java:342)
>> at
>> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:137)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out
>> after [1 milliseconds]
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>> at
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>> at scala.concurrent.Await$.result(package.scala:107)
>> at scala.concurrent.Await.result(package.scala)
>> at
>> org.apache.flink.runtime.taskmanager.TaskInputSplitProvider.getNextInputSplit(TaskInputSplitProvider.java:71)
>> ... 4 more
>>
>> What can be the cause? Is it because the whole DataSource reading has
>> cannot take more than 1 

Re: Failed to stream on Yarn cluster

2016-04-28 Thread patcharee

Hi again,

Actually it works well! I just realized from looking at Yarn application 
log that the Flink streaming result is printed in taskmanager.out. When 
I sent a question to the mailing list I looked at the screen where I 
issued the command, and there was no streaming result there.


Where is the taskmanager.out and how can I change it?

Best,
Patcharee


On 28. april 2016 13:18, Maximilian Michels wrote:

Hi Patcharee,

What do you mean by "nothing happened"? There is no output? Did you
check the logs?

Cheers,
Max

On Thu, Apr 28, 2016 at 12:10 PM, patcharee  wrote:

Hi,

I am testing the streaming wiki example -
https://ci.apache.org/projects/flink/flink-docs-master/quickstart/run_example_quickstart.html

It works fine from local mode (mvn exec:java
-Dexec.mainClass=wikiedits.WikipediaAnalysis). But when I run the jar on
Yarn cluster mode, nothing happened. Any ideas? I tested the word count
example from hdfs file on Yarn cluster and it worked fine.

Best,
Patcharee






Re: General Data questions - streams vs batch

2016-04-28 Thread Fabian Hueske
Hi Konstantin,

if you do not need a deterministic grouping of elements you should not use
a keyed stream or window.
Instead you can do the lookups in a parallel flatMap function. The function
would collect arriving elements and perform a lookup query after a certain
number of elements arrived (can cause high latency if the arrival rate of
elements is low or varies).
The flatmap function can be executed in parallel and does not require a
keyed stream.

Best, Fabian


2016-04-25 18:58 GMT+02:00 Konstantin Kulagin :

> As usual - thanks for answers, Aljoscha!
>
> I think I understood what I want to know.
>
> 1) To add few comments: about streams I was thinking about something
> similar to Storm where you can have one Source and 'duplicate' the same
> entry going through different 'path's.
> Something like this:
> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_storm-user-guide/content/figures/1/figures/SpoutsAndBolts.png
> And later you can 'join' these separate streams back.
> And actually I think this is what I meant:
> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/datastream/JoinedStreams.html
> - this one actually 'joins' by window.
>
> As for 'exact-once-guarantee' I've got the difference from this paper:
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink
> - Thanks!
>
> 2) understood, thank you very much
>
>
>
>
>
>
> I'll probably bother you one more time with another question:
>
> 3) Lets say I have a Source which provides raw (i.e. non-keyed) data. And
> lets say I need to 'enhance' each entry with some fields which I can take
> from a database.
> So I define some DbEnhanceOperation
>
> Database query might be expensive - so I would want to
> a) batch entries to perform queries
> b) be able to have several parallel DbEnhaceOperations so those will not
> slow down my whole processing.
>
>
> I do not see a way to do that?
>
>
> Problems:
>
> I cannot go with countWindowAll because of b) - that thing does not
> support several streams (correct?)
>
> So I need to create a windowed stream and for that I need to have some key
> - Correct? I.e cannot create windows on a stream of general object just
> using number of objects.
>
> I probably can 'emulate' keyed stream by providing some 'fake' key. But in
> this case I can parallelize only on different keys. Again - it is probably
> doable by introducing some AtomicLong key generator at the first place (
> this part probably hard to understand - I can share details if necessary)
> but still looks like a bit of hack :)
>
> But the general question - if I can implement 3) 'normally' in a flink-way?
>
> Thanks!
> Konstantin.
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Apr 25, 2016 at 10:53 AM, Aljoscha Krettek 
> wrote:
>
>> Hi,
>> I'll try and answer your questions separately. First, a general remark,
>> although Flink has the DataSet API for batch processing and the DataStream
>> API for stream processing we only have one underlying streaming execution
>> engine that is used for both. Now, regarding the questions:
>>
>> 1) What do you mean by "parallel into 2 streams"? Maybe that could
>> influence my answer but I'll just give a general answer: Flink does not
>> give any guarantees about the ordering of elements in a Stream or in a
>> DataSet. This means that merging or unioning two streams/data sets will
>> just mean that operations see all elements in the two merged streams but
>> the order in which we see them is arbitrary. This means that we don't keep
>> buffers based on time or size or anything.
>>
>> 2) The elements that flow through the topology are not tracked
>> individually, each operation just receives elements, updates state and
>> sends elements to downstream operation. In essence this means that elements
>> themselves don't block any resources except if they alter some kept state
>> in operations. If you have a stateless pipeline that only has
>> filters/maps/flatMaps then the amount of required resources is very low.
>>
>> For a finite data set, elements are also streamed through the topology.
>> Only if you use operations that require grouping or sorting (such as
>> groupBy/reduce and join) will elements be buffered in memory or on disk
>> before they are processed.
>>
>> Two answer your last question. If you only do stateless
>> transformations/filters then you are fine to use either API and the
>> performance should be similar.
>>
>> Cheers,
>> Aljoscha
>>
>> On Sun, 24 Apr 2016 at 15:54 Konstantin Kulagin 
>> wrote:
>>
>>> Hi guys,
>>>
>>> I have some kind of general question in order to get more understanding
>>> of stream vs final data transformation. More specific - I am trying to
>>> understand 'entities' lifecycle during processing.
>>>
>>> 1) For example in a case of streams: suppose we start with some
>>> key-value source, parallel it into 2 streams by key. Each 

Re: WindowedStream vs AllWindowedStream

2016-04-28 Thread Aljoscha Krettek
They are different classes because the signature of their apply method is
different. If one were the subclass, it would be possible to call apply
with the wrong signature.

On Thu, 28 Apr 2016 at 12:25 Radu Prodan  wrote:

> Hi all,
>
> I have question about the differences between WindowedStream and
> AllWindowedStream. From the definition I see that WindowedStream are
> partitioned based on key but for AllWindowedStream this is not the case.
>
> So, what comes to my mind is, why WindowedStream is not the special case
> of AllWindowedStream? or why those classes are completely separated, and
> not extending from one another?
>
> Radu
>


Re: Failed to stream on Yarn cluster

2016-04-28 Thread Maximilian Michels
Hi Patcharee,

What do you mean by "nothing happened"? There is no output? Did you
check the logs?

Cheers,
Max

On Thu, Apr 28, 2016 at 12:10 PM, patcharee  wrote:
> Hi,
>
> I am testing the streaming wiki example -
> https://ci.apache.org/projects/flink/flink-docs-master/quickstart/run_example_quickstart.html
>
> It works fine from local mode (mvn exec:java
> -Dexec.mainClass=wikiedits.WikipediaAnalysis). But when I run the jar on
> Yarn cluster mode, nothing happened. Any ideas? I tested the word count
> example from hdfs file on Yarn cluster and it worked fine.
>
> Best,
> Patcharee
>
>


WindowedStream vs AllWindowedStream

2016-04-28 Thread Radu Prodan
Hi all,

I have question about the differences between WindowedStream and
AllWindowedStream. From the definition I see that WindowedStream are
partitioned based on key but for AllWindowedStream this is not the case.

So, what comes to my mind is, why WindowedStream is not the special case of
AllWindowedStream? or why those classes are completely separated, and  not
extending from one another?

Radu


Failed to stream on Yarn cluster

2016-04-28 Thread patcharee

Hi,

I am testing the streaming wiki example - 
https://ci.apache.org/projects/flink/flink-docs-master/quickstart/run_example_quickstart.html


It works fine from local mode (mvn exec:java 
-Dexec.mainClass=wikiedits.WikipediaAnalysis). But when I run the jar on 
Yarn cluster mode, nothing happened. Any ideas? I tested the word count 
example from hdfs file on Yarn cluster and it worked fine.


Best,
Patcharee




Re: Configuring a RichFunction on a DataStream

2016-04-28 Thread Fabian Hueske
Hi Robert,

Function configuration via a Configuration object and the open method is an
artifact from the past.
The recommended way is to configure the function object via the
constructor.
Flink serializes the function object and ships them to the workers for
execution. So the state of a function is preserved. Note, the function
object must be serializable with Java serialization.

Best, Fabian

2016-04-28 10:33 GMT+02:00 Robert Schmidtke :

> Hi everyone,
>
> I noticed that in the DataSet API, there is the .withParameters function
> that allows passing values to a RichFunction's open method. I was wondering
> whether a similar approach can be used to the same thing in a DataStream.
> Right now I'm getting the parameters via getRuntimeContext, but it feels a
> little awkward given the available Configuration object.
>
> Thanks!
>
> Robert
>
>
> --
> My GPG Key ID: 336E2680
>


Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
I checked the input format from your PR, but didn't see anything
suspicious.

It is definitely OK if the processing of an input split tasks more than 10
seconds. That should not be the cause.
It rather looks like the DataSourceTask fails to request a new split from
the JobManager.

2016-04-28 9:37 GMT+02:00 Stefano Bortoli :

> Digging the logs, we found this:
>
> WARN  Remoting - Tried to associate with unreachable remote address
> [akka.tcp://flink@127.0.0.1:34984]. Address is now gated for 5000 ms, all
> messages to this address will be delivered to dead letters. Reason:
> Connessione rifiutata: /127.0.0.1:34984
>
> however, it is not clear why it should refuse a connection to itself after
> 40min of run. we'll try to figure out possible environment issues. Its a
> fresh installation, therefore we may have left out some configurations.
>
> saluti,
> Stefano
>
> 2016-04-28 9:22 GMT+02:00 Stefano Bortoli :
>
>> I had this type of exception when trying to build and test Flink on a
>> "small machine". I worked around the test increasing the timeout for Akka.
>>
>>
>> https://github.com/stefanobortoli/flink/blob/FLINK-1827/flink-tests/src/test/java/org/apache/flink/test/checkpointing/EventTimeAllWindowCheckpointingITCase.java
>>
>> it happened only on my machine (a VirtualBox I use for development), but
>> not on Flavio's. Is it possible that on load situations the JobManager
>> slows down a bit too much?
>>
>> saluti,
>> Stefano
>>
>> 2016-04-27 17:50 GMT+02:00 Flavio Pompermaier :
>>
>>> A precursor of the modified connector (since we started a long time
>>> ago). However the idea is the same, I compute the inputSplits and then I
>>> get the data split by split (similarly to what it happens in FLINK-3750 -
>>> https://github.com/apache/flink/pull/1941 )
>>>
>>> Best,
>>> Flavio
>>>
>>> On Wed, Apr 27, 2016 at 5:38 PM, Chesnay Schepler 
>>> wrote:
>>>
 Are you using your modified connector or the currently available one?


 On 27.04.2016 17:35, Flavio Pompermaier wrote:

 Hi to all,
 I'm running a Flink Job on a JDBC datasource and I obtain the following
 exception:

 java.lang.RuntimeException: Requesting the next InputSplit failed.
 at
 org.apache.flink.runtime.taskmanager.TaskInputSplitProvider.getNextInputSplit(TaskInputSplitProvider.java:91)
 at
 org.apache.flink.runtime.operators.DataSourceTask$1.hasNext(DataSourceTask.java:342)
 at
 org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:137)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out
 after [1 milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at scala.concurrent.Await.result(package.scala)
 at
 org.apache.flink.runtime.taskmanager.TaskInputSplitProvider.getNextInputSplit(TaskInputSplitProvider.java:71)
 ... 4 more

 What can be the cause? Is it because the whole DataSource reading has
 cannot take more than 1 milliseconds?

 Best,
 Flavio



>>>
>>>
>>
>


Re: About flink stream table API

2016-04-28 Thread Fabian Hueske
Hi,

Table API and SQL for streaming are work in progress. A first version which
supports projection, filter, and union is merged to the master branch.
Under the hood, Flink uses Calcite to optimize and translate Table API and
SQL queries.

Best, Fabian

2016-04-27 14:27 GMT+02:00 Zhangrucong :

> Hello everybody:
>
>  I want to learn the flink stream API. The stream sql is the same with
> calcite?
>
>  In the flowing link, the examples of table api are dataset, where I
> can see the detail introduction of streaming table API.
>
>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
>
>
>
>   somebody who can help me.
>
>  Thanks in advance!
>
>
>
>
>


Re: Multiple windows with large number of partitions

2016-04-28 Thread Aljoscha Krettek
Hi,
is there are reason for keying on both the "date only" field and the
"userid". I think you should be fine by just specifying that you want 1-day
windows on your timestamps.

Also, do you have a timestamp extractor in place that takes the timestamp
from your data and sets it as the internal timestamp field. This is
explained in more detail here:
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/event_timestamps_watermarks.html#timestamp-assigners--watermark-generators

Cheers,
Aljoscha

On Thu, 28 Apr 2016 at 06:04 Christopher Santiago 
wrote:

> I've been working through the flink demo applications and started in on a
> prototype, but have run into an issue with how to approach the problem of
> getting a daily unique user count from a traffic stream.  I'm using a time
> characteristic event time.
>
> Sample event stream
> (timestamp,userid):
> 2015-12-02T01:13:21.002Z,bc030136a91aa46eb436dcb28fa72fed
> 2015-12-02T01:13:21.003Z,bc030136a91aa46eb436dcb28fa72fed
> 2015-12-02T01:37:48.003Z,bc030136a91aa46eb436dcb28fa72fed
> 2015-12-02T01:37:48.004Z,bc030136a91aa46eb436dcb28fa72fed
> 2015-12-02T02:02:15.004Z,bc030136a91aa46eb436dcb28fa72fed
> 2015-12-02T00:00:00.000Z,5dd63d9756a975d0f4be6a6856005381
> 2015-12-02T00:16:58.000Z,5dd63d9756a975d0f4be6a6856005381
> 2015-12-02T00:00:00.000Z,ccd72e4535c92c499bb66eea6f4f9aab
> 2015-12-02T00:14:56.000Z,ccd72e4535c92c499bb66eea6f4f9aab
> 2015-12-02T00:14:56.001Z,ccd72e4535c92c499bb66eea6f4f9aab
> 2015-12-02T00:29:52.001Z,ccd72e4535c92c499bb66eea6f4f9aab
> 2015-12-02T00:29:52.002Z,ccd72e4535c92c499bb66eea6f4f9aab
> 2015-12-02T00:44:48.002Z,ccd72e4535c92c499bb66eea6f4f9aab
>
> Requirements:
> 1.  Get a count of the unique users by day.
> 2.  Early results should be emitted as quickly as possible.  (I've been
> trying to use 30/60 seconds windows)
> 3.  Events are accepted up to 2 days late.
>
> I've used the following as guidance:
>
> EventTimeTriggerWithEarlyAndLateFiring
>
> https://raw.githubusercontent.com/kl0u/flink-examples/master/src/main/java/com/dataartisans/flinksolo/beam_comparison/customTriggers/EventTimeTriggerWithEarlyAndLateFiring.java
>
> Multi-window transformations
>
> http://mail-archives.apache.org/mod_mbox/flink-user/201604.mbox/%3CBAY182-W870B521BDC5709973990D5ED9A0%40phx.gbl%3E
>
> Summing over aggregate windows
>
> http://stackoverflow.com/questions/36791816/how-to-declare-1-minute-tumbling-window
>
> but I can't get the reduce/aggregation logic working correctly.  Here's a
> sample of how I have the windows setup with a datastream of tuple3 with
> timestamp, date only, userid:
>
> DataStream> uniqueLogins = logins
> .keyBy(1,2)
> .timeWindow(Time.days(1))
> .trigger(EventTimeNoLog.create()  //Same as
> EventTimeTriggerWithEarlyAndLateFiring, just modified logging since it is
> potentially 2-3x per event read in
> .withEarlyFiringEvery(Time.seconds(60))
> .withLateFiringEvery(Time.seconds(600))
> .withAllowedLateness(Time.days(2)))
> //Reduce to earliest timestamp for a given day for a user
> .reduce(new ReduceFunction>() {
> public Tuple3 reduce(Tuple3 String> event1, Tuple3 event2) {
> return event1;
> }
> });
>
> SingleOutputStreamOperator> window = uniqueLogins
> .timeWindowAll(Time.days(1))
> .trigger(EventTimeTriggerWithEarlyAndLateFiring.create()
> .withEarlyFiringEvery(Time.seconds(60))
> .withLateFiringEvery(Time.seconds(600))
> .withAllowedLateness(Time.days(2))
> .aggregator())  //Modified EventTimeTriggerWithEarlyAndLateFiring that
> does a fire_and_purge on onProcessingTime when aggregator is set
> //Manually count
> .apply(new AllWindowFunction,
> Tuple2, TimeWindow>() {
> @Override
> public void apply(TimeWindow window,
> Iterable> input, Collector Long>> out) throws Exception {
> int count = 0;
> String windowTime = null;
>
>  for (Tuple3 login: input) {
>  windowTime = login.f1;
>  count++;
>  }
>  out.collect (new Tuple2(windowTime, new Long(count)));
> }
> });
>
> From the logging I've put in place, it seems that there is a performance
> issue with the first keyBy where there now a unique window for each
> date/user combination (in my sample data around 500k windows) which when
> reducing is not emitting results at a constant enough rate for the second
> window to perform its aggregation at a scheduleable interval.  Is there a
> better approach to performing this type of calculation directly in flink?
>


RE: classpath issue on yarn

2016-04-28 Thread aris kol
So,I shaded guava.The whole think works fine locally (stand alone local flink), 
but on yarn (forgot to mention it runs on EMR), I get the 
following:org.apache.flink.client.program.ProgramInvocationException: Unknown 
I/O error while extracting contained jar files. at 
org.apache.flink.client.program.PackagedProgram.extractContainedLibaries(PackagedProgram.java:729)
   at 
org.apache.flink.client.program.PackagedProgram.(PackagedProgram.java:192)
 at org.apache.flink.client.CliFrontend.buildProgram(CliFrontend.java:922)  
 at org.apache.flink.client.CliFrontend.run(CliFrontend.java:301)at 
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1192)   at 
org.apache.flink.client.CliFrontend.main(CliFrontend.java:1243)Caused by: 
java.util.zip.ZipException: error in opening zip file  at 
java.util.zip.ZipFile.open(Native Method)at 
java.util.zip.ZipFile.(ZipFile.java:219)   at 
java.util.zip.ZipFile.(ZipFile.java:149)   at 
java.util.jar.JarFile.(JarFile.java:166)   at 
java.util.jar.JarFile.(JarFile.java:130)   at 
org.apache.flink.client.program.PackagedProgram.extractContainedLibaries(PackagedProgram.java:647)
   ... 5 moreI removed the shaded dependency and I got back to the previous 
error.Any clues?Thanks,Aris
From: gizera...@hotmail.com
To: user@flink.apache.org
Subject: RE: classpath issue on yarn
Date: Tue, 26 Apr 2016 21:03:50 +




Hi Robert,
Thank you for your prompt response.No, I downloaded it from an apache mirror.I 
think yarn loads the hadoop universe before the user classpath by default, so I 
reckon I would get this exception even without flink in the middle.I can still 
see both the old and the new MoreExecutors class in flink-dist (the old as 
org/apache/flink/hadoop/shaded/com/google/common/util/concurrentthe new as 
org/apache/flink/shaded/com/google/common/util/concurrent)I reckon I should try 
to shade guava in my side, but the Shade plugin in sbt-assembly is quite fresh.
I will try and report.
Thanks,Aris


From: rmetz...@apache.org
Date: Tue, 26 Apr 2016 18:42:31 +0200
Subject: Re: classpath issue on yarn
To: user@flink.apache.org

Hi Aris,
Did you build the 1.0.2 flink-dist yourself?If not, which exact version did you 
download?For example this file: 
http://www.apache.org/dyn/closer.lua/flink/flink-1.0.2/flink-1.0.2-bin-hadoop2-scala_2.11.tgz
 has a clean flink-dist jar.


On Tue, Apr 26, 2016 at 12:28 PM, aris kol  wrote:



Hi guys,I ran into a weird classpath issue while running a streaming job on a 
yarn cluster.I have a relatively simple flow that reads data from kafka, does a 
few manipulations and then indexes them on Elasticsearch (2.3).I use the 
elasticsearch2 connector (1.1-SNAPSHOT) (bleeding edge, I know).The stream 
works fine in a local flink node (1.0.2) (reading from remote kafka and writing 
to remote es).However, when deployed to the remote YARN cluster (again, flink 
1.0.2) the following exception is thrown:```04/26/2016 10:07:30 Source: Custom 
Source -> Flat Map -> Sink: Unnamed(1/8) switched to FAILED 
java.lang.NoSuchMethodError: 
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
 at org.elasticsearch.threadpool.ThreadPool.(ThreadPool.java:190)   
 at 
org.elasticsearch.client.transport.TransportClient$Builder.build(TransportClient.java:131)
   at 
org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink.open(ElasticsearchSink.java:164)
  at 
org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:38)
 at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:91)
   at 
org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:317)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:215) 
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at 
java.lang.Thread.run(Thread.java:745)04/26/2016 10:07:30 Job execution 
switched to status FAILING.java.lang.NoSuchMethodError: 
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
   at org.elasticsearch.threadpool.ThreadPool.(ThreadPool.java:190) 
   at 
org.elasticsearch.client.transport.TransportClient$Builder.build(TransportClient.java:131)
   at 
org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink.open(ElasticsearchSink.java:164)
  at 
org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:38)
 at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:91)
   at 
org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:317)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:215) 
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at 

Create a cluster inside Flink

2016-04-28 Thread Simone Robutti
Hello everyone,

I'm approaching a rather big and complex integration with an existing
software and I would like to hear the opinion of more experienced users on
how to tackle a few issues.

This software builds a cloud with its own logic. What I need is to keep
these nodes as instances inside the TaskManagers and use these instances to
perform operation with dedicated operators. I need to move tabular data
back and forth from and to Flink's Datasets and be able to call methods on
these instances.

I would like to receive suggestions on how to implement this behaviour.
First I thought about using Flink's actor system but I discovered it is not
accessible. So I would like to understand how to properly create these
instances (new thread inside a mapPartition?), how to call methods on them
(create a custom context?) and convert data from a Dataset or a Table to
the custom format of this software (this probably won't be much of a
problem, I will write wrappers or at worst replicate the data).

Any suggestion is welcome.

Thanks,

Simone


Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Ufuk Celebi
Hey Ken!

That should not happen. Can you check the web interface for two things:

- How many available slots are advertized on the landing page
(localhost:8081) when you submit your job?
- Can you check the actual parallelism of the submitted job (it should
appear as a FAILED job in the web frontend). Is it really 15?

– Ufuk

On Thu, Apr 28, 2016 at 12:52 AM, Ken Krugler
 wrote:
> Hi all,
>
> In trying out different settings for performance, I run into a job failure
> case that puzzles me.
>
> I’d done a run with a parallelism of 20 (-p 20 via CLI), and the job ran
> successfully, on a cluster with 40 slots.
>
> I then tried with -p 15, and it failed with:
>
> NoResourceAvailableException: Not enough free slots available to run the
> job. You can decrease the operator parallelism…
>
> But the change was to reduce parallelism - why would that now cause this
> problem?
>
> Thanks,
>
> — Ken
>
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>


Configuring a RichFunction on a DataStream

2016-04-28 Thread Robert Schmidtke
Hi everyone,

I noticed that in the DataSet API, there is the .withParameters function
that allows passing values to a RichFunction's open method. I was wondering
whether a similar approach can be used to the same thing in a DataStream.
Right now I'm getting the parameters via getRuntimeContext, but it feels a
little awkward given the available Configuration object.

Thanks!

Robert


-- 
My GPG Key ID: 336E2680


Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Aljoscha Krettek
Hi,
is this a streaming or batch job? If it is a batch job, are you using
either collect() or print() on a DataSet?

Cheers,
Aljoscha

On Thu, 28 Apr 2016 at 00:52 Ken Krugler 
wrote:

> Hi all,
>
> In trying out different settings for performance, I run into a job failure
> case that puzzles me.
>
> I’d done a run with a parallelism of 20 (-p 20 via CLI), and the job ran
> successfully, on a cluster with 40 slots.
>
> I then tried with -p 15, and it failed with:
>
> NoResourceAvailableException: Not enough free slots available to run the
> job. You can decrease the operator parallelism…
>
> But the change was to reduce parallelism - why would that now cause this
> problem?
>
> Thanks,
>
> — Ken
>
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>


Re: Requesting the next InputSplit failed

2016-04-28 Thread Stefano Bortoli
Digging the logs, we found this:

WARN  Remoting - Tried to associate with unreachable remote address
[akka.tcp://flink@127.0.0.1:34984]. Address is now gated for 5000 ms, all
messages to this address will be delivered to dead letters. Reason:
Connessione rifiutata: /127.0.0.1:34984

however, it is not clear why it should refuse a connection to itself after
40min of run. we'll try to figure out possible environment issues. Its a
fresh installation, therefore we may have left out some configurations.

saluti,
Stefano

2016-04-28 9:22 GMT+02:00 Stefano Bortoli :

> I had this type of exception when trying to build and test Flink on a
> "small machine". I worked around the test increasing the timeout for Akka.
>
>
> https://github.com/stefanobortoli/flink/blob/FLINK-1827/flink-tests/src/test/java/org/apache/flink/test/checkpointing/EventTimeAllWindowCheckpointingITCase.java
>
> it happened only on my machine (a VirtualBox I use for development), but
> not on Flavio's. Is it possible that on load situations the JobManager
> slows down a bit too much?
>
> saluti,
> Stefano
>
> 2016-04-27 17:50 GMT+02:00 Flavio Pompermaier :
>
>> A precursor of the modified connector (since we started a long time ago).
>> However the idea is the same, I compute the inputSplits and then I get the
>> data split by split (similarly to what it happens in FLINK-3750 -
>> https://github.com/apache/flink/pull/1941 )
>>
>> Best,
>> Flavio
>>
>> On Wed, Apr 27, 2016 at 5:38 PM, Chesnay Schepler 
>> wrote:
>>
>>> Are you using your modified connector or the currently available one?
>>>
>>>
>>> On 27.04.2016 17:35, Flavio Pompermaier wrote:
>>>
>>> Hi to all,
>>> I'm running a Flink Job on a JDBC datasource and I obtain the following
>>> exception:
>>>
>>> java.lang.RuntimeException: Requesting the next InputSplit failed.
>>> at
>>> org.apache.flink.runtime.taskmanager.TaskInputSplitProvider.getNextInputSplit(TaskInputSplitProvider.java:91)
>>> at
>>> org.apache.flink.runtime.operators.DataSourceTask$1.hasNext(DataSourceTask.java:342)
>>> at
>>> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:137)
>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out
>>> after [1 milliseconds]
>>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>> at
>>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>> at scala.concurrent.Await$.result(package.scala:107)
>>> at scala.concurrent.Await.result(package.scala)
>>> at
>>> org.apache.flink.runtime.taskmanager.TaskInputSplitProvider.getNextInputSplit(TaskInputSplitProvider.java:71)
>>> ... 4 more
>>>
>>> What can be the cause? Is it because the whole DataSource reading has
>>> cannot take more than 1 milliseconds?
>>>
>>> Best,
>>> Flavio
>>>
>>>
>>>
>>
>>
>


Re: Requesting the next InputSplit failed

2016-04-28 Thread Stefano Bortoli
I had this type of exception when trying to build and test Flink on a
"small machine". I worked around the test increasing the timeout for Akka.

https://github.com/stefanobortoli/flink/blob/FLINK-1827/flink-tests/src/test/java/org/apache/flink/test/checkpointing/EventTimeAllWindowCheckpointingITCase.java

it happened only on my machine (a VirtualBox I use for development), but
not on Flavio's. Is it possible that on load situations the JobManager
slows down a bit too much?

saluti,
Stefano

2016-04-27 17:50 GMT+02:00 Flavio Pompermaier :

> A precursor of the modified connector (since we started a long time ago).
> However the idea is the same, I compute the inputSplits and then I get the
> data split by split (similarly to what it happens in FLINK-3750 -
> https://github.com/apache/flink/pull/1941 )
>
> Best,
> Flavio
>
> On Wed, Apr 27, 2016 at 5:38 PM, Chesnay Schepler 
> wrote:
>
>> Are you using your modified connector or the currently available one?
>>
>>
>> On 27.04.2016 17:35, Flavio Pompermaier wrote:
>>
>> Hi to all,
>> I'm running a Flink Job on a JDBC datasource and I obtain the following
>> exception:
>>
>> java.lang.RuntimeException: Requesting the next InputSplit failed.
>> at
>> org.apache.flink.runtime.taskmanager.TaskInputSplitProvider.getNextInputSplit(TaskInputSplitProvider.java:91)
>> at
>> org.apache.flink.runtime.operators.DataSourceTask$1.hasNext(DataSourceTask.java:342)
>> at
>> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:137)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [1 milliseconds]
>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>> at
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>> at scala.concurrent.Await$.result(package.scala:107)
>> at scala.concurrent.Await.result(package.scala)
>> at
>> org.apache.flink.runtime.taskmanager.TaskInputSplitProvider.getNextInputSplit(TaskInputSplitProvider.java:71)
>> ... 4 more
>>
>> What can be the cause? Is it because the whole DataSource reading has
>> cannot take more than 1 milliseconds?
>>
>> Best,
>> Flavio
>>
>>
>>
>
>